Message boards :
ATLAS application :
Current batch contains tasks with abnormal runtimes
Message board moderation
Author | Message |
---|---|
Send message Joined: 14 Jan 10 Posts: 1411 Credit: 9,398,233 RAC: 13,272 |
The current batch has very short running tasks. The downloaded pool.root file is only 887 kB instead of the 'normal' 300 - 500 MB. Events are processed between 5 and 60 seconds on my system. The number of events is 500. |
Send message Joined: 21 Feb 11 Posts: 72 Credit: 570,086 RAC: 139 |
Now they run for 8 hours. |
Send message Joined: 14 Jan 10 Posts: 1411 Credit: 9,398,233 RAC: 13,272 |
I only compare on one and the same system. The runtime of the very short tasks on that system was 30 - 32 minutes as the normal runtime is about 2 hours on average. Tasks now have a download file of 390MB, an upload result file of 370MB and the runtimes are about 9 hours with 4 to 5 threads/task on the VM. |
Send message Joined: 14 Jan 10 Posts: 1411 Credit: 9,398,233 RAC: 13,272 |
.... and I haven't had much success in bringing them back to life again. Suspending and resuming worked for one, but the other two finished with 'computation error'.When the last event has processed a result file (HITS-file) must be compilated. That needs no CPU and lasts several minutes. Another point is, that you are using multi-core tasks. Towards the end of a task, not all CPUs are used anymore, cause there are not enought events left from the initial 500. So the percentage of your CPU is throttling down from 4 to 1 core or from 8 to 1 core during the last events. Such last event could just be a longer one. Half an hour or so. So don't touch a running task or suspend and resume. |
Send message Joined: 21 Feb 11 Posts: 72 Credit: 570,086 RAC: 139 |
One good thing is that we won't have atlas unsent tasks any more. Server will now be able to keep up and generate new tasks. |
Send message Joined: 28 Sep 04 Posts: 719 Credit: 48,219,747 RAC: 30,083 |
This morning some Atlas tasks have been cancelled by the server. Some of them had failed on other hosts but some were only sent to my hosts but are now cancelled. |
Send message Joined: 2 May 07 Posts: 2228 Credit: 173,740,034 RAC: 20,313 |
+1 |
Send message Joined: 3 Nov 12 Posts: 55 Credit: 138,867,104 RAC: 100,734 |
This morning some Atlas tasks have been cancelled by the server. Some of them had failed on other hosts but some were only sent to my hosts but are now cancelled. Same game here. |
Send message Joined: 14 Jan 10 Posts: 1411 Credit: 9,398,233 RAC: 13,272 |
This morning some Atlas tasks have been cancelled by the server. Some of them had failed on other hosts but some were only sent to my hosts but are now cancelled.To me it seems that those results are not needed anymore, so quite normal that the tasks are cancelled. |
Send message Joined: 2 May 07 Posts: 2228 Credit: 173,740,034 RAC: 20,313 |
Yes, this is often seen at the end of the month. |
Send message Joined: 28 Sep 04 Posts: 719 Credit: 48,219,747 RAC: 30,083 |
I am annoyed that the server cancels tasks that have already accumulated 30-40 hours of CPU time. They should be allowed to finish and get credit or at least get credit for the calculation they have already gone through. Now most of them are registered as being 0 seconds long like this:https://lhcathome.cern.ch/lhcathome/result.php?resultid=394973218 although it had 40 CPU hours on the clock. This cancelled task on the other hand shows calculation time correctly but still got no credit for it: https://lhcathome.cern.ch/lhcathome/result.php?resultid=394970727 |
Send message Joined: 28 Sep 04 Posts: 719 Credit: 48,219,747 RAC: 30,083 |
Again the server cancelled a bunch of tasks while they were running. Now three of them, all with over 30 hours CPU time on the clock. Here's one https://lhcathome.cern.ch/lhcathome/result.php?resultid=395000335 Are these tasks faulty and would have failed anyway? I can't tell from the stderr what is the reason and what is the result for cancelling. |
Send message Joined: 18 Dec 15 Posts: 1783 Credit: 116,951,995 RAC: 67,751 |
Again the server cancelled a bunch of tasks while they were running. Now three of them, all with over 30 hours CPU time on the clock. ...:-( :-( :-( |
Send message Joined: 28 Sep 04 Posts: 719 Credit: 48,219,747 RAC: 30,083 |
Here we go again :-( Task cancelled by server after 64 hours of CPU time: https://lhcathome.cern.ch/lhcathome/result.php?resultid=395312899 Consumed crunching time is not recorded on server but I can see it in BoincTasks on my computer. |
Send message Joined: 21 Feb 11 Posts: 72 Credit: 570,086 RAC: 139 |
Will it be recorded when you sync boinc client with server? |
Send message Joined: 28 Sep 04 Posts: 719 Credit: 48,219,747 RAC: 30,083 |
Will it be recorded when you sync boinc client with server? No, it is not. |
Send message Joined: 28 Sep 04 Posts: 719 Credit: 48,219,747 RAC: 30,083 |
Here is another different type odd one: https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=212105058 Mine had a validation error but look at the one that got the credit. Strange are the ways of validator.... |
©2024 CERN