Message boards :
ATLAS application :
Did the idle time increase for each task?
Message board moderation
Author | Message |
---|---|
Send message Joined: 14 Sep 08 Posts: 52 Credit: 64,094,999 RAC: 17,187 |
AFAIC, all ATLAS native tasks have an idle period at the beginning. It used to be around 15-20 minutes even just a week ago, but recently it became 35-40 minutes for all my hosts. Coupled with the smaller WU where two cores can finish under 2 hours, the machine is sitting idle pretty frequently. Anyone else seeing the same? |
Send message Joined: 2 May 07 Posts: 2244 Credit: 173,902,375 RAC: 307 |
Have today only one Atlas-Task finished correct. Seeing the same problem. Stopped all Atlas for the next time. |
Send message Joined: 14 Sep 08 Posts: 52 Credit: 64,094,999 RAC: 17,187 |
Thanks. At least I still see HITS files produced for most of my tasks, so hopefully they aren't completely broken... |
Send message Joined: 24 May 23 Posts: 43 Credit: 2,624,143 RAC: 5,537 |
AFAIC, all ATLAS native tasks have an idle period at the beginning. It used to be around 15-20 minutes even just a week ago, but recently it became 35-40 minutes for all my hosts. Coupled with the smaller WU where two cores can finish under 2 hours, the machine is sitting idle pretty frequently. Anyone else seeing the same? Yes. However, mostly if you run only ATLAS tasks with Boinc on that machine, You could try these workarounds: 1) assign only one virtual core to each WU. This way, the idle time, which is independent of computational power, is followed by a longer computational time, so being a lesser portion of the total; 2) in /var/lib/boinc-client/cc_config.xml you can fake (<ncpus>NUMBER</ncpus>) an higher number of virtual CPUs, so to run more WUs concurrently. Doing this, it's better to assing the higher possible virtual CPU number to each ATLAS WU (ideally as many as the virtual cores of the machine, but never more than the events of each workunit). As long as the WUs are out of phase, if only one is, or some are, in the computing stage, You're not wasting any CPU time. Instead, if more WUs compute toghether, their computational time increases, giving You the same benefit of point 1). I like better the second one. Sometimes I have faked 60 virtual CPU on a machine with 12, so to be able to run 5 WUs (each with 12 virtual cores assigned) at the same time. It worked. Not perfect, but it gave some gains. -- Bye |
Send message Joined: 14 Sep 08 Posts: 52 Credit: 64,094,999 RAC: 17,187 |
1) works for small tasks, until a new batch of larger tasks hit and they will take days to finish, without checkpointing too. Similarly for 2), even if I assume perfect splay, the fake cpus I should set depends on the compute:idle ratio, which in turn depends on task size. (I use navg_cpus to control number of WUs on the host, instead of directly faking ncpu for my host. The effect is equivalent anyway. ) Given the variation of job size from time to time, I'm not very keen on constantly monitoring and maintaining the config. So I've been sticking with a third option: limit ATLAS to half of the SMT threads and have other projects fill the rest. So at worst I idle the SMT thread of each core, not having a core sitting idle completely. These are all workarounds though. I made the post mostly trying to confirm it's not my own problem. When the workload characteristics suddenly change all that much, there is always a chance something actually broke. For example, perhaps some cvmfs config needs to be updated, etc. |
©2024 CERN