Message boards : ATLAS application : Did the idle time increase for each task?
Message board moderation

To post messages, you must log in.

AuthorMessage
wujj123456

Send message
Joined: 14 Sep 08
Posts: 52
Credit: 64,094,999
RAC: 17,187
Message 50996 - Posted: 2 Nov 2024, 17:10:34 UTC
Last modified: 2 Nov 2024, 17:12:14 UTC

AFAIC, all ATLAS native tasks have an idle period at the beginning. It used to be around 15-20 minutes even just a week ago, but recently it became 35-40 minutes for all my hosts. Coupled with the smaller WU where two cores can finish under 2 hours, the machine is sitting idle pretty frequently. Anyone else seeing the same?
ID: 50996 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2244
Credit: 173,902,375
RAC: 307
Message 50997 - Posted: 2 Nov 2024, 17:16:33 UTC - in response to Message 50996.  

Have today only one Atlas-Task finished correct.
Seeing the same problem. Stopped all Atlas for the next time.
ID: 50997 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 52
Credit: 64,094,999
RAC: 17,187
Message 50998 - Posted: 2 Nov 2024, 17:27:57 UTC - in response to Message 50997.  

Thanks. At least I still see HITS files produced for most of my tasks, so hopefully they aren't completely broken...
ID: 50998 · Report as offensive     Reply Quote
Lem Novantotto

Send message
Joined: 24 May 23
Posts: 43
Credit: 2,624,143
RAC: 5,537
Message 50999 - Posted: 2 Nov 2024, 18:01:53 UTC - in response to Message 50996.  

AFAIC, all ATLAS native tasks have an idle period at the beginning. It used to be around 15-20 minutes even just a week ago, but recently it became 35-40 minutes for all my hosts. Coupled with the smaller WU where two cores can finish under 2 hours, the machine is sitting idle pretty frequently. Anyone else seeing the same?

Yes.

However, mostly if you run only ATLAS tasks with Boinc on that machine, You could try these workarounds:

1) assign only one virtual core to each WU. This way, the idle time, which is independent of computational power, is followed by a longer computational time, so being a lesser portion of the total;

2) in /var/lib/boinc-client/cc_config.xml you can fake (<ncpus>NUMBER</ncpus>) an higher number of virtual CPUs, so to run more WUs concurrently. Doing this, it's better to assing the higher possible virtual CPU number to each ATLAS WU (ideally as many as the virtual cores of the machine, but never more than the events of each workunit). As long as the WUs are out of phase, if only one is, or some are, in the computing stage, You're not wasting any CPU time. Instead, if more WUs compute toghether, their computational time increases, giving You the same benefit of point 1).

I like better the second one. Sometimes I have faked 60 virtual CPU on a machine with 12, so to be able to run 5 WUs (each with 12 virtual cores assigned) at the same time.
It worked. Not perfect, but it gave some gains.
--
Bye
ID: 50999 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 52
Credit: 64,094,999
RAC: 17,187
Message 51000 - Posted: 2 Nov 2024, 18:53:40 UTC - in response to Message 50999.  
Last modified: 2 Nov 2024, 18:56:51 UTC

1) works for small tasks, until a new batch of larger tasks hit and they will take days to finish, without checkpointing too. Similarly for 2), even if I assume perfect splay, the fake cpus I should set depends on the compute:idle ratio, which in turn depends on task size. (I use navg_cpus to control number of WUs on the host, instead of directly faking ncpu for my host. The effect is equivalent anyway. ) Given the variation of job size from time to time, I'm not very keen on constantly monitoring and maintaining the config. So I've been sticking with a third option: limit ATLAS to half of the SMT threads and have other projects fill the rest. So at worst I idle the SMT thread of each core, not having a core sitting idle completely.

These are all workarounds though. I made the post mostly trying to confirm it's not my own problem. When the workload characteristics suddenly change all that much, there is always a chance something actually broke. For example, perhaps some cvmfs config needs to be updated, etc.
ID: 51000 · Report as offensive     Reply Quote

Message boards : ATLAS application : Did the idle time increase for each task?


©2024 CERN