WOW 1000 / 5000 events in one WU ? !

Author	Message
Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1280 Credit: 8,491,903 RAC: 2,069	Message 32780 - Posted: 11 Oct 2017, 9:43:01 UTC - in response to Message 32778. ..., showed the 4.400MB default value, and not the higher value (7.200MB) from the app_config.xml. Probably you still have Max # of cores in your project preferences set to 2. Is there any other way to increase the 7.200MB value for this current task? Not for a running task. It's possible, but your task should start from the very beginning and I think that's not what you want. BTW, what somewhat bothers me is that console_2 still shows 305 as the highest event number, so no change since early morning. Could this mean that the task is running in some kind of endless loop? If you still see in Console3 the athena's running almost 100% then it's OK. Console2 does show the event from yesterday or the day before like Yeti described. You could try the lock screen key locking/releasing very quickly to see the actuals events. ID: 32780 · Reply Quote

David Cameron Project administrator Project developer Project scientist Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0	Message 32781 - Posted: 11 Oct 2017, 10:48:27 UTC It looks like this output is sorted by hours as for me it is 23:59:xx Well spotted Yeti! This is indeed a bug in the script generating the console output and I will fix it soon. I assume (i dont know if that calculation is actually correct): If you look here: https://bigpanda.cern.ch/task/?jeditaskid=12236583 you can see that there are 1000000 events to process with 60 input files, hence 1000000/60=16666 events/wu. You can compare that with the tasks id 12236561 for example ("normal tasks"): https://bigpanda.cern.ch/task/?jeditaskid=12236561. Here 9965000/199300=50 events/wu. The monster WUs process 1000 events. The task 12236583 was badly configured and there should be no new WU submitted for that task until the experts have sorted it out. Assuming the disk values are ok now then the current WU should complete eventually. The memory usage should not increase significantly even after running for many days. Edit: What I forgot to mention - although this is a 2-core task, 3 athena.py are running (as seen in console_3) People have asked this a few times so I added a section to the "info on ATLAS tasks" thread. Basically there is one master process which controls the other processes which do the real simulation. The master process should use very little CPU on average. ID: 32781 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2100 Credit: 159,816,975 RAC: 134,993	Message 32783 - Posted: 11 Oct 2017, 11:21:32 UTC If this is such a Task. It was running with native App Successful: https://lhcathome.cern.ch/lhcathome/result.php?resultid=158748026 ID: 32783 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1280 Credit: 8,491,903 RAC: 2,069	Message 32785 - Posted: 11 Oct 2017, 11:47:29 UTC - in response to Message 32783. If this is such a Task. It was running with native App Successful: https://lhcathome.cern.ch/lhcathome/result.php?resultid=158748026 It was one of batch 12189412 (a 99% ready batch) with 'only' 50 events. ID: 32785 · Reply Quote

Yeti Volunteer moderator Send message Joined: 2 Sep 04 Posts: 453 Credit: 193,569,815 RAC: 10,128	Message 32786 - Posted: 11 Oct 2017, 12:29:57 UTC I had to abort mine because there seem to have happened some reset inside the VM; the actual time was back again (at 14:30) and the events where down to something 30 or above Supporting BOINC, a great concept ! ID: 32786 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2100 Credit: 159,816,975 RAC: 134,993	Message 32787 - Posted: 11 Oct 2017, 12:40:46 UTC - in response to Message 32785. Last modified: 11 Oct 2017, 12:44:25 UTC duplicate, sorry. ID: 32787 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2100 Credit: 159,816,975 RAC: 134,993	Message 32788 - Posted: 11 Oct 2017, 12:40:50 UTC - in response to Message 32785. Last modified: 11 Oct 2017, 12:42:23 UTC It was one of batch 12189412 (a 99% ready batch) with 'only' 50 events. Thank you Crystal. Have made a look in bigpanda. This task failed there. It looks like a mountain of ice. 1% is over the water. ID: 32788 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1280 Credit: 8,491,903 RAC: 2,069	Message 32789 - Posted: 11 Oct 2017, 13:06:47 UTC - in response to Message 32788. It was one of batch 12189412 (a 99% ready batch) with 'only' 50 events. Thank you Crystal. Have made a look in bigpanda. This task failed there. It looks like a mountain of ice. 1% is over the water. Saw that, but strange enough in BOINC's stderr.txt there was a HITS file uploaded. So uploading an HITS-file is no guarantee for success. ID: 32789 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1688 Credit: 103,823,448 RAC: 121,753	Message 32793 - Posted: 11 Oct 2017, 15:16:26 UTC - in response to Message 32789. ... So uploading an HITS-file is no guarantee for success. :-( :-( :-( ID: 32793 · Reply Quote

David Cameron Project administrator Project developer Project scientist Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0	Message 32804 - Posted: 12 Oct 2017, 9:51:50 UTC - in response to Message 32789. It was one of batch 12189412 (a 99% ready batch) with 'only' 50 events. Thank you Crystal. Have made a look in bigpanda. This task failed there. It looks like a mountain of ice. 1% is over the water. Saw that, but strange enough in BOINC's stderr.txt there was a HITS file uploaded. So uploading an HITS-file is no guarantee for success. The problem with this task is related to the fact it was restarted: "This is trying to run the run_atlas wrapper for the 2nd time,but it is not an Event Service job, so will restart the job" It seems we do not clean up enough after the restart which led to the eventual failure. We will work on improving this. ID: 32804 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1688 Credit: 103,823,448 RAC: 121,753	Message 32820 - Posted: 13 Oct 2017, 5:36:40 UTC - in response to Message 32772. Last modified: 13 Oct 2017, 6:23:58 UTC The situation of the "monster task" at this moment is as follows: Runtime 1 day and 7 hours. In console_2, the event count somehow got stuck at 305; this is the highest value which has been shown since this early morning. No increase since. In console_3, under "mem", from the reserved 7.276.828k some 129.000k are shown as "free". In the Windows Explorer, the image size shown in the slot directory is 4.576.256kb. So, any advice from the experts as to how I should proceed? Is there a chance that the task can complete? Or will the reserved memory be exhausted soon, and/or will I run into the rsc_disc_bound problem? Just an update to report about the current situation: The task has now run for 3 days and 6 hours. I was able to increase the rsc_disk_bound value, so there should be no problem from this side (the size of the image right now is 7,41GB) What still makes me worry is the free memory shown in console_3: from the 7276828k (which I had made available in the app_config.xml) yesterday the "free" value jumped up to some 2,4GB, but right now it's dropped again to read 144184k, which is very low - so I might run out of memory at some point. Still no way to see the number of events processed (out of the total 1000) in console_2 (regardsless of what I am trying). Further, the deadline for the task is Oct. 17. So my question: any idea how much time a 1000 events task would roughly take? Is there a chance at all to get it finished within the 1-week-deadline? ID: 32820 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1688 Credit: 103,823,448 RAC: 121,753	Message 32823 - Posted: 13 Oct 2017, 10:14:41 UTC when I just came back to the PC, the Monster Task was no longer running :-( Stderr tells the following: <core_client_version>7.6.22</core_client_version> <![CDATA[ <message> aborted by project - no longer usable </message> Why did "the project" abort my task? 3 1/2 days with 2 cores for nothing :-( Rather frustrating :-( ID: 32823 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1688 Credit: 103,823,448 RAC: 121,753	Message 32826 - Posted: 14 Oct 2017, 7:42:48 UTC - in response to Message 32823. when I just came back to the PC, the Monster Task was no longer running :-( Stderr tells the following: <core_client_version>7.6.22</core_client_version> <![CDATA[ <message> aborted by project - no longer usable </message> Why did "the project" abort my task? 3 1/2 days with 2 cores for nothing :-( Rather frustrating :-( No one any idea why or who exactly from the "project" killed my task? I am still quite annoyed about that, to say the least :-( ID: 32826 · Reply Quote

David Cameron Project administrator Project developer Project scientist Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0	Message 32844 - Posted: 16 Oct 2017, 10:20:00 UTC - in response to Message 32826. when I just came back to the PC, the Monster Task was no longer running :-( Stderr tells the following: 7.6.22 aborted by project - no longer usable Why did "the project" abort my task? 3 1/2 days with 2 cores for nothing :-( Rather frustrating :-( No one any idea why or who exactly from the "project" killed my task? I am still quite annoyed about that, to say the least :-( These monster WU were caused by a badly configured batch of tasks, and the people responsible cancelled them all. I thought that the running WU would be allowed to complete so at least you would get the credit (and satisfaction of completing it) but everything was cancelled, sorry for wasting your CPU. ID: 32844 · Reply Quote

LHC@home