Message boards : ATLAS application : Current batch contains tasks with abnormal runtimes
Message board moderation

To post messages, you must log in.

AuthorMessage
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1411
Credit: 9,398,233
RAC: 13,272
Message 48120 - Posted: 23 May 2023, 6:59:41 UTC

The current batch has very short running tasks.
The downloaded pool.root file is only 887 kB instead of the 'normal' 300 - 500 MB.
Events are processed between 5 and 60 seconds on my system. The number of events is 500.
ID: 48120 · Report as offensive     Reply Quote
kotenok2000
Avatar

Send message
Joined: 21 Feb 11
Posts: 72
Credit: 570,086
RAC: 139
Message 48162 - Posted: 31 May 2023, 15:45:29 UTC - in response to Message 48120.  

Now they run for 8 hours.
ID: 48162 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1411
Credit: 9,398,233
RAC: 13,272
Message 48163 - Posted: 31 May 2023, 16:54:46 UTC - in response to Message 48162.  

I only compare on one and the same system.
The runtime of the very short tasks on that system was 30 - 32 minutes as the normal runtime is about 2 hours on average.
Tasks now have a download file of 390MB, an upload result file of 370MB and the runtimes are about 9 hours with 4 to 5 threads/task on the VM.
ID: 48163 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1411
Credit: 9,398,233
RAC: 13,272
Message 48189 - Posted: 4 Jun 2023, 12:45:12 UTC - in response to Message 48188.  
Last modified: 4 Jun 2023, 19:21:56 UTC

.... and I haven't had much success in bringing them back to life again. Suspending and resuming worked for one, but the other two finished with 'computation error'.
When the last event has processed a result file (HITS-file) must be compilated. That needs no CPU and lasts several minutes.
Another point is, that you are using multi-core tasks. Towards the end of a task, not all CPUs are used anymore, cause there are not enought events left from the initial 500.
So the percentage of your CPU is throttling down from 4 to 1 core or from 8 to 1 core during the last events. Such last event could just be a longer one. Half an hour or so.
So don't touch a running task or suspend and resume.
ID: 48189 · Report as offensive     Reply Quote
kotenok2000
Avatar

Send message
Joined: 21 Feb 11
Posts: 72
Credit: 570,086
RAC: 139
Message 48192 - Posted: 4 Jun 2023, 16:01:57 UTC
Last modified: 4 Jun 2023, 16:15:51 UTC

One good thing is that we won't have atlas unsent tasks any more.
Server will now be able to keep up and generate new tasks.
ID: 48192 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 719
Credit: 48,219,747
RAC: 30,083
Message 48193 - Posted: 5 Jun 2023, 7:23:51 UTC

This morning some Atlas tasks have been cancelled by the server. Some of them had failed on other hosts but some were only sent to my hosts but are now cancelled.
ID: 48193 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2228
Credit: 173,740,034
RAC: 20,313
Message 48194 - Posted: 5 Jun 2023, 7:55:05 UTC - in response to Message 48193.  

+1
ID: 48194 · Report as offensive     Reply Quote
Saturn911

Send message
Joined: 3 Nov 12
Posts: 55
Credit: 138,867,104
RAC: 100,734
Message 48195 - Posted: 5 Jun 2023, 8:23:14 UTC - in response to Message 48193.  

This morning some Atlas tasks have been cancelled by the server. Some of them had failed on other hosts but some were only sent to my hosts but are now cancelled.

Same game here.
ID: 48195 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1411
Credit: 9,398,233
RAC: 13,272
Message 48196 - Posted: 5 Jun 2023, 8:56:46 UTC - in response to Message 48193.  

This morning some Atlas tasks have been cancelled by the server. Some of them had failed on other hosts but some were only sent to my hosts but are now cancelled.
To me it seems that those results are not needed anymore, so quite normal that the tasks are cancelled.
ID: 48196 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2228
Credit: 173,740,034
RAC: 20,313
Message 48197 - Posted: 5 Jun 2023, 9:06:31 UTC - in response to Message 48196.  

Yes, this is often seen at the end of the month.
ID: 48197 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 719
Credit: 48,219,747
RAC: 30,083
Message 48215 - Posted: 13 Jun 2023, 21:38:14 UTC

I am annoyed that the server cancels tasks that have already accumulated 30-40 hours of CPU time. They should be allowed to finish and get credit or at least get credit for the calculation they have already gone through. Now most of them are registered as being 0 seconds long like this:https://lhcathome.cern.ch/lhcathome/result.php?resultid=394973218 although it had 40 CPU hours on the clock.
This cancelled task on the other hand shows calculation time correctly but still got no credit for it: https://lhcathome.cern.ch/lhcathome/result.php?resultid=394970727
ID: 48215 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 719
Credit: 48,219,747
RAC: 30,083
Message 48220 - Posted: 15 Jun 2023, 19:50:27 UTC

Again the server cancelled a bunch of tasks while they were running. Now three of them, all with over 30 hours CPU time on the clock. Here's one https://lhcathome.cern.ch/lhcathome/result.php?resultid=395000335
Are these tasks faulty and would have failed anyway? I can't tell from the stderr what is the reason and what is the result for cancelling.
ID: 48220 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1783
Credit: 116,951,995
RAC: 67,751
Message 48222 - Posted: 16 Jun 2023, 7:11:35 UTC - in response to Message 48220.  

Again the server cancelled a bunch of tasks while they were running. Now three of them, all with over 30 hours CPU time on the clock. ...
:-( :-( :-(
ID: 48222 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 719
Credit: 48,219,747
RAC: 30,083
Message 48226 - Posted: 18 Jun 2023, 16:07:08 UTC

Here we go again :-( Task cancelled by server after 64 hours of CPU time: https://lhcathome.cern.ch/lhcathome/result.php?resultid=395312899 Consumed crunching time is not recorded on server but I can see it in BoincTasks on my computer.
ID: 48226 · Report as offensive     Reply Quote
kotenok2000
Avatar

Send message
Joined: 21 Feb 11
Posts: 72
Credit: 570,086
RAC: 139
Message 48227 - Posted: 18 Jun 2023, 16:08:45 UTC - in response to Message 48226.  

Will it be recorded when you sync boinc client with server?
ID: 48227 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 719
Credit: 48,219,747
RAC: 30,083
Message 48228 - Posted: 18 Jun 2023, 17:37:04 UTC - in response to Message 48227.  

Will it be recorded when you sync boinc client with server?

No, it is not.
ID: 48228 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 719
Credit: 48,219,747
RAC: 30,083
Message 48237 - Posted: 22 Jun 2023, 16:26:44 UTC

Here is another different type odd one: https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=212105058 Mine had a validation error but look at the one that got the credit. Strange are the ways of validator....
ID: 48237 · Report as offensive     Reply Quote

Message boards : ATLAS application : Current batch contains tasks with abnormal runtimes


©2024 CERN