Did the idle time increase for each task?

Author	Message
wujj123456 Send message Joined: 14 Sep 08 Posts: 52 Credit: 73,339,376 RAC: 20,884	Message 50996 - Posted: 2 Nov 2024, 17:10:34 UTC Last modified: 2 Nov 2024, 17:12:14 UTC AFAIC, all ATLAS native tasks have an idle period at the beginning. It used to be around 15-20 minutes even just a week ago, but recently it became 35-40 minutes for all my hosts. Coupled with the smaller WU where two cores can finish under 2 hours, the machine is sitting idle pretty frequently. Anyone else seeing the same? ID: 50996 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2267 Credit: 175,671,719 RAC: 110	Message 50997 - Posted: 2 Nov 2024, 17:16:33 UTC - in response to Message 50996. Have today only one Atlas-Task finished correct. Seeing the same problem. Stopped all Atlas for the next time. ID: 50997 · Reply Quote

wujj123456 Send message Joined: 14 Sep 08 Posts: 52 Credit: 73,339,376 RAC: 20,884	Message 50998 - Posted: 2 Nov 2024, 17:27:57 UTC - in response to Message 50997. Thanks. At least I still see HITS files produced for most of my tasks, so hopefully they aren't completely broken... ID: 50998 · Reply Quote

Lem Novantotto Send message Joined: 24 May 23 Posts: 52 Credit: 4,469,843 RAC: 127	Message 50999 - Posted: 2 Nov 2024, 18:01:53 UTC - in response to Message 50996. AFAIC, all ATLAS native tasks have an idle period at the beginning. It used to be around 15-20 minutes even just a week ago, but recently it became 35-40 minutes for all my hosts. Coupled with the smaller WU where two cores can finish under 2 hours, the machine is sitting idle pretty frequently. Anyone else seeing the same? Yes. However, mostly if you run only ATLAS tasks with Boinc on that machine, You could try these workarounds: 1) assign only one virtual core to each WU. This way, the idle time, which is independent of computational power, is followed by a longer computational time, so being a lesser portion of the total; 2) in /var/lib/boinc-client/cc_config.xml you can fake (<ncpus>NUMBER</ncpus>) an higher number of virtual CPUs, so to run more WUs concurrently. Doing this, it's better to assing the higher possible virtual CPU number to each ATLAS WU (ideally as many as the virtual cores of the machine, but never more than the events of each workunit). As long as the WUs are out of phase, if only one is, or some are, in the computing stage, You're not wasting any CPU time. Instead, if more WUs compute toghether, their computational time increases, giving You the same benefit of point 1). I like better the second one. Sometimes I have faked 60 virtual CPU on a machine with 12, so to be able to run 5 WUs (each with 12 virtual cores assigned) at the same time. It worked. Not perfect, but it gave some gains. -- Bye ID: 50999 · Reply Quote

wujj123456 Send message Joined: 14 Sep 08 Posts: 52 Credit: 73,339,376 RAC: 20,884	Message 51000 - Posted: 2 Nov 2024, 18:53:40 UTC - in response to Message 50999. Last modified: 2 Nov 2024, 18:56:51 UTC 1) works for small tasks, until a new batch of larger tasks hit and they will take days to finish, without checkpointing too. Similarly for 2), even if I assume perfect splay, the fake cpus I should set depends on the compute:idle ratio, which in turn depends on task size. (I use navg_cpus to control number of WUs on the host, instead of directly faking ncpu for my host. The effect is equivalent anyway. ) Given the variation of job size from time to time, I'm not very keen on constantly monitoring and maintaining the config. So I've been sticking with a third option: limit ATLAS to half of the SMT threads and have other projects fill the rest. So at worst I idle the SMT thread of each core, not having a core sitting idle completely. These are all workarounds though. I made the post mostly trying to confirm it's not my own problem. When the workload characteristics suddenly change all that much, there is always a chance something actually broke. For example, perhaps some cvmfs config needs to be updated, etc. ID: 51000 · Reply Quote

LHC@home