Message boards :
ATLAS application :
WOW 1000 / 5000 events in one WU ? !
Message board moderation
Previous · 1 · 2
Author | Message |
---|---|
Send message Joined: 14 Jan 10 Posts: 1280 Credit: 8,491,903 RAC: 2,069 |
..., showed the 4.400MB default value, and not the higher value (7.200MB) from the app_config.xml. Probably you still have Max # of cores in your project preferences set to 2. Is there any other way to increase the 7.200MB value for this current task? Not for a running task. It's possible, but your task should start from the very beginning and I think that's not what you want. BTW, what somewhat bothers me is that console_2 still shows 305 as the highest event number, so no change since early morning. Could this mean that the task is running in some kind of endless loop? If you still see in Console3 the athena's running almost 100% then it's OK. Console2 does show the event from yesterday or the day before like Yeti described. You could try the lock screen key locking/releasing very quickly to see the actuals events. |
Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0 |
It looks like this output is sorted by hours as for me it is 23:59:xx Well spotted Yeti! This is indeed a bug in the script generating the console output and I will fix it soon. I assume (i dont know if that calculation is actually correct): The monster WUs process 1000 events. The task 12236583 was badly configured and there should be no new WU submitted for that task until the experts have sorted it out. Assuming the disk values are ok now then the current WU should complete eventually. The memory usage should not increase significantly even after running for many days. Edit: What I forgot to mention - although this is a 2-core task, 3 athena.py are running (as seen in console_3) People have asked this a few times so I added a section to the "info on ATLAS tasks" thread. Basically there is one master process which controls the other processes which do the real simulation. The master process should use very little CPU on average. |
Send message Joined: 2 May 07 Posts: 2100 Credit: 159,816,975 RAC: 134,993 |
If this is such a Task. It was running with native App Successful: https://lhcathome.cern.ch/lhcathome/result.php?resultid=158748026 |
Send message Joined: 14 Jan 10 Posts: 1280 Credit: 8,491,903 RAC: 2,069 |
If this is such a Task. It was running with native App Successful: It was one of batch 12189412 (a 99% ready batch) with 'only' 50 events. |
Send message Joined: 2 Sep 04 Posts: 453 Credit: 193,569,815 RAC: 10,128 |
|
Send message Joined: 2 May 07 Posts: 2100 Credit: 159,816,975 RAC: 134,993 |
duplicate, sorry. |
Send message Joined: 2 May 07 Posts: 2100 Credit: 159,816,975 RAC: 134,993 |
It was one of batch 12189412 (a 99% ready batch) with 'only' 50 events. Thank you Crystal. Have made a look in bigpanda. This task failed there. It looks like a mountain of ice. 1% is over the water. |
Send message Joined: 14 Jan 10 Posts: 1280 Credit: 8,491,903 RAC: 2,069 |
It was one of batch 12189412 (a 99% ready batch) with 'only' 50 events. Saw that, but strange enough in BOINC's stderr.txt there was a HITS file uploaded. So uploading an HITS-file is no guarantee for success. |
Send message Joined: 18 Dec 15 Posts: 1688 Credit: 103,823,448 RAC: 121,753 |
... So uploading an HITS-file is no guarantee for success. :-( :-( :-( |
Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0 |
It was one of batch 12189412 (a 99% ready batch) with 'only' 50 events. The problem with this task is related to the fact it was restarted: "This is trying to run the run_atlas wrapper for the 2nd time,but it is not an Event Service job, so will restart the job" It seems we do not clean up enough after the restart which led to the eventual failure. We will work on improving this. |
Send message Joined: 18 Dec 15 Posts: 1688 Credit: 103,823,448 RAC: 121,753 |
The situation of the "monster task" at this moment is as follows: Just an update to report about the current situation: The task has now run for 3 days and 6 hours. I was able to increase the rsc_disk_bound value, so there should be no problem from this side (the size of the image right now is 7,41GB) What still makes me worry is the free memory shown in console_3: from the 7276828k (which I had made available in the app_config.xml) yesterday the "free" value jumped up to some 2,4GB, but right now it's dropped again to read 144184k, which is very low - so I might run out of memory at some point. Still no way to see the number of events processed (out of the total 1000) in console_2 (regardsless of what I am trying). Further, the deadline for the task is Oct. 17. So my question: any idea how much time a 1000 events task would roughly take? Is there a chance at all to get it finished within the 1-week-deadline? |
Send message Joined: 18 Dec 15 Posts: 1688 Credit: 103,823,448 RAC: 121,753 |
when I just came back to the PC, the Monster Task was no longer running :-( Stderr tells the following: <core_client_version>7.6.22</core_client_version> <![CDATA[ <message> aborted by project - no longer usable </message> Why did "the project" abort my task? 3 1/2 days with 2 cores for nothing :-( Rather frustrating :-( |
Send message Joined: 18 Dec 15 Posts: 1688 Credit: 103,823,448 RAC: 121,753 |
when I just came back to the PC, the Monster Task was no longer running :-( No one any idea why or who exactly from the "project" killed my task? I am still quite annoyed about that, to say the least :-( |
Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0 |
when I just came back to the PC, the Monster Task was no longer running :-( These monster WU were caused by a badly configured batch of tasks, and the people responsible cancelled them all. I thought that the running WU would be allowed to complete so at least you would get the credit (and satisfaction of completing it) but everything was cancelled, sorry for wasting your CPU. |
©2024 CERN