Message boards : ATLAS application : WOW 1000 / 5000 events in one WU ? !
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1280
Credit: 8,491,903
RAC: 2,069
Message 32780 - Posted: 11 Oct 2017, 9:43:01 UTC - in response to Message 32778.  

..., showed the 4.400MB default value, and not the higher value (7.200MB) from the app_config.xml.

Probably you still have Max # of cores in your project preferences set to 2.

Is there any other way to increase the 7.200MB value for this current task?

Not for a running task. It's possible, but your task should start from the very beginning and I think that's not what you want.

BTW, what somewhat bothers me is that console_2 still shows 305 as the highest event number, so no change since early morning. Could this mean that the task is running in some kind of endless loop?

If you still see in Console3 the athena's running almost 100% then it's OK.
Console2 does show the event from yesterday or the day before like Yeti described.
You could try the lock screen key locking/releasing very quickly to see the actuals events.
ID: 32780 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 387
Credit: 15,314,184
RAC: 0
Message 32781 - Posted: 11 Oct 2017, 10:48:27 UTC

It looks like this output is sorted by hours as for me it is 23:59:xx


Well spotted Yeti! This is indeed a bug in the script generating the console output and I will fix it soon.

I assume (i dont know if that calculation is actually correct):
If you look here:
https://bigpanda.cern.ch/task/?jeditaskid=12236583 you can see that there are 1000000 events to process with 60 input files, hence 1000000/60=16666 events/wu.
You can compare that with the tasks id 12236561 for example ("normal tasks"):
https://bigpanda.cern.ch/task/?jeditaskid=12236561. Here 9965000/199300=50 events/wu.


The monster WUs process 1000 events. The task 12236583 was badly configured and there should be no new WU submitted for that task until the experts have sorted it out. Assuming the disk values are ok now then the current WU should complete eventually. The memory usage should not increase significantly even after running for many days.

Edit: What I forgot to mention - although this is a 2-core task, 3 athena.py are running (as seen in console_3)


People have asked this a few times so I added a section to the "info on ATLAS tasks" thread. Basically there is one master process which controls the other processes which do the real simulation. The master process should use very little CPU on average.
ID: 32781 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2100
Credit: 159,816,975
RAC: 134,993
Message 32783 - Posted: 11 Oct 2017, 11:21:32 UTC

If this is such a Task. It was running with native App Successful:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=158748026
ID: 32783 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1280
Credit: 8,491,903
RAC: 2,069
Message 32785 - Posted: 11 Oct 2017, 11:47:29 UTC - in response to Message 32783.  

If this is such a Task. It was running with native App Successful:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=158748026

It was one of batch 12189412 (a 99% ready batch) with 'only' 50 events.
ID: 32785 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 453
Credit: 193,569,815
RAC: 10,128
Message 32786 - Posted: 11 Oct 2017, 12:29:57 UTC

I had to abort mine because there seem to have happened some reset inside the VM; the actual time was back again (at 14:30) and the events where down to something 30 or above


Supporting BOINC, a great concept !
ID: 32786 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2100
Credit: 159,816,975
RAC: 134,993
Message 32787 - Posted: 11 Oct 2017, 12:40:46 UTC - in response to Message 32785.  
Last modified: 11 Oct 2017, 12:44:25 UTC

duplicate, sorry.
ID: 32787 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2100
Credit: 159,816,975
RAC: 134,993
Message 32788 - Posted: 11 Oct 2017, 12:40:50 UTC - in response to Message 32785.  
Last modified: 11 Oct 2017, 12:42:23 UTC

It was one of batch 12189412 (a 99% ready batch) with 'only' 50 events.

Thank you Crystal. Have made a look in bigpanda.
This task failed there. It looks like a mountain of ice. 1% is over the water.
ID: 32788 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1280
Credit: 8,491,903
RAC: 2,069
Message 32789 - Posted: 11 Oct 2017, 13:06:47 UTC - in response to Message 32788.  

It was one of batch 12189412 (a 99% ready batch) with 'only' 50 events.

Thank you Crystal. Have made a look in bigpanda.
This task failed there. It looks like a mountain of ice. 1% is over the water.

Saw that, but strange enough in BOINC's stderr.txt there was a HITS file uploaded. So uploading an HITS-file is no guarantee for success.
ID: 32789 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1688
Credit: 103,823,448
RAC: 121,753
Message 32793 - Posted: 11 Oct 2017, 15:16:26 UTC - in response to Message 32789.  

... So uploading an HITS-file is no guarantee for success.

:-( :-( :-(
ID: 32793 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 387
Credit: 15,314,184
RAC: 0
Message 32804 - Posted: 12 Oct 2017, 9:51:50 UTC - in response to Message 32789.  

It was one of batch 12189412 (a 99% ready batch) with 'only' 50 events.

Thank you Crystal. Have made a look in bigpanda.
This task failed there. It looks like a mountain of ice. 1% is over the water.

Saw that, but strange enough in BOINC's stderr.txt there was a HITS file uploaded. So uploading an HITS-file is no guarantee for success.


The problem with this task is related to the fact it was restarted:

"This is trying to run the run_atlas wrapper for the 2nd time,but it is not an Event Service job, so will restart the job"

It seems we do not clean up enough after the restart which led to the eventual failure. We will work on improving this.
ID: 32804 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1688
Credit: 103,823,448
RAC: 121,753
Message 32820 - Posted: 13 Oct 2017, 5:36:40 UTC - in response to Message 32772.  
Last modified: 13 Oct 2017, 6:23:58 UTC

The situation of the "monster task" at this moment is as follows:

Runtime 1 day and 7 hours.

In console_2, the event count somehow got stuck at 305; this is the highest value which has been shown since this early morning. No increase since.

In console_3, under "mem", from the reserved 7.276.828k some 129.000k are shown as "free".

In the Windows Explorer, the image size shown in the slot directory is 4.576.256kb.

So, any advice from the experts as to how I should proceed? Is there a chance that the task can complete? Or will the reserved memory be exhausted soon, and/or will I run into the rsc_disc_bound problem?


Just an update to report about the current situation:

The task has now run for 3 days and 6 hours. I was able to increase the rsc_disk_bound value, so there should be no problem from this side (the size of the image right now is 7,41GB)

What still makes me worry is the free memory shown in console_3:
from the 7276828k (which I had made available in the app_config.xml) yesterday the "free" value jumped up to some 2,4GB, but right now it's dropped again to read 144184k, which is very low - so I might run out of memory at some point.

Still no way to see the number of events processed (out of the total 1000) in console_2 (regardsless of what I am trying).
Further, the deadline for the task is Oct. 17. So my question: any idea how much time a 1000 events task would roughly take? Is there a chance at all to get it finished within the 1-week-deadline?
ID: 32820 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1688
Credit: 103,823,448
RAC: 121,753
Message 32823 - Posted: 13 Oct 2017, 10:14:41 UTC

when I just came back to the PC, the Monster Task was no longer running :-(

Stderr tells the following:

<core_client_version>7.6.22</core_client_version>
<![CDATA[
<message>
aborted by project - no longer usable
</message>

Why did "the project" abort my task?
3 1/2 days with 2 cores for nothing :-( Rather frustrating :-(
ID: 32823 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1688
Credit: 103,823,448
RAC: 121,753
Message 32826 - Posted: 14 Oct 2017, 7:42:48 UTC - in response to Message 32823.  

when I just came back to the PC, the Monster Task was no longer running :-(

Stderr tells the following:

<core_client_version>7.6.22</core_client_version>
<![CDATA[
<message>
aborted by project - no longer usable
</message>

Why did "the project" abort my task?
3 1/2 days with 2 cores for nothing :-( Rather frustrating :-(


No one any idea why or who exactly from the "project" killed my task?
I am still quite annoyed about that, to say the least :-(
ID: 32826 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 387
Credit: 15,314,184
RAC: 0
Message 32844 - Posted: 16 Oct 2017, 10:20:00 UTC - in response to Message 32826.  

when I just came back to the PC, the Monster Task was no longer running :-(

Stderr tells the following:

7.6.22

aborted by project - no longer usable


Why did "the project" abort my task?
3 1/2 days with 2 cores for nothing :-( Rather frustrating :-(


No one any idea why or who exactly from the "project" killed my task?
I am still quite annoyed about that, to say the least :-(


These monster WU were caused by a badly configured batch of tasks, and the people responsible cancelled them all. I thought that the running WU would be allowed to complete so at least you would get the credit (and satisfaction of completing it) but everything was cancelled, sorry for wasting your CPU.
ID: 32844 · Report as offensive     Reply Quote
Previous · 1 · 2

Message boards : ATLAS application : WOW 1000 / 5000 events in one WU ? !


©2024 CERN