Message boards : ATLAS application : WOW 1000 / 5000 events in one WU ? !
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 340
Credit: 46,697,229
RAC: 32,944
Message 32652 - Posted: 6 Oct 2017, 13:28:44 UTC

Hi,

just found:



It is 5-core-task, but I can see 6 athena.py ?



How many events are really in it ?


Supporting BOINC, a great concept !
ID: 32652 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 469
Credit: 3,348,239
RAC: 876
Message 32655 - Posted: 6 Oct 2017, 15:58:44 UTC - in response to Message 32652.  

How many events are really in it ?

Knowing the taskID (in stderr.txt) would help.
ID: 32655 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 783
Credit: 5,529,459
RAC: 9,405
Message 32657 - Posted: 6 Oct 2017, 16:39:20 UTC

Yeti, how long did this task run so far?
ID: 32657 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 469
Credit: 3,348,239
RAC: 876
Message 32659 - Posted: 6 Oct 2017, 17:34:23 UTC - in response to Message 32657.  

Yeti, how long did this task run so far?

The athena's in his picture show 3358 minutes = 56 hours
ID: 32659 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 340
Credit: 46,697,229
RAC: 32,944
Message 32660 - Posted: 6 Oct 2017, 18:03:58 UTC

It is/was this task: https://lhcathome.cern.ch/lhcathome/result.php?resultid=158522058

It has already finished with: Disk usage limit exceeded


Supporting BOINC, a great concept !
ID: 32660 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 469
Credit: 3,348,239
RAC: 876
Message 32661 - Posted: 6 Oct 2017, 19:46:52 UTC - in response to Message 32660.  

It is/was this task: https://lhcathome.cern.ch/lhcathome/result.php?resultid=158522058

It has already finished with: Disk usage limit exceeded

Yeah, that's one of the batch 12236583 I mentioned in this thread.
The task should have finished after 1000 events.
I had the disk limit error after I suspended the task and a 2GB snapshot file was written into the slot directory.
With you it seems the disk limit exceeded while the task was still running.
It looks like only the native application will succeed to finish those tasks or one has to manipulate the rsc_disk_bound in an early stage of such a task.
ID: 32661 · Report as offensive     Reply Quote
Profile HerveUAE
Avatar

Send message
Joined: 18 Dec 16
Posts: 120
Credit: 8,350,237
RAC: 4,561
Message 32663 - Posted: 7 Oct 2017, 4:44:58 UTC - in response to Message 32661.  

It looks like only the native application will succeed to finish those tasks or one has to manipulate the rsc_disk_bound in an early stage of such a task.

So should I abort the tasks that I have been running for more than 3 days? I use VirtualBox and not the native app.
We are the product of random evolution.
ID: 32663 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 340
Credit: 46,697,229
RAC: 32,944
Message 32759 - Posted: 10 Oct 2017, 15:23:57 UTC - in response to Message 32661.  

... or one has to manipulate the rsc_disk_bound in an early stage of such a task.

I just gave this a try, I changed rsc_discbound from 6xxxxxxxx to 60xxxxxxxx

The WU is now up and running again, I'm looking forward what will happen


Supporting BOINC, a great concept !
ID: 32759 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 783
Credit: 5,529,459
RAC: 9,405
Message 32760 - Posted: 10 Oct 2017, 15:52:19 UTC
Last modified: 10 Oct 2017, 16:19:26 UTC

I have a 2-core task running for 16:20 hours now (console 2 shows 221 tasks so far) 4th line from top in console 3 - "mem" - shows the following values:

7.276.828k total - 7.133.176k used - 143.900k free - 67.416k buffers

the "used" value is increasing continuously, hence the "free" value is falling continuously. What will happen if the free memory is used up? I guess the task will break off.
I was hoping that the memory value of 7300MB in the app_config.xml would be sufficient at any rate. Which is obviously not the case this time :-(
ID: 32760 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 183
Credit: 4,472,111
RAC: 9,563
Message 32764 - Posted: 10 Oct 2017, 19:20:13 UTC

Hi all,

Something has gone wrong in our task submission, it's not intentional to have such long WU! I'll consult with the experts tomorrow, but in the meantime I've increased the disk requirement to 8GB so that if you get one of these monster WUs it can have a chance to complete.
ID: 32764 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 783
Credit: 5,529,459
RAC: 9,405
Message 32765 - Posted: 10 Oct 2017, 19:31:41 UTC

David, can you tell how many events task 158823734 is scheduled to contain?
Right now, the count is at 267. If the total is 1000, then I guess it wouldn't make sense to continue, because I would very likely run into the "disc-bound" problem, right?
Besides that most probably I would run out of memory, too.
ID: 32765 · Report as offensive     Reply Quote
gyllic

Send message
Joined: 9 Dec 14
Posts: 151
Credit: 1,678,907
RAC: 2,591
Message 32770 - Posted: 11 Oct 2017, 3:57:44 UTC - in response to Message 32765.  

David, can you tell how many events task 158823734 is scheduled to contain?

Where do you see tasks with the id you mentioned in the queue right now? I assume you are talking about the 12236583 tasks, because these are the "monster tasks".

I assume (i dont know if that calculation is actually correct):
If you look here:
https://bigpanda.cern.ch/task/?jeditaskid=12236583 you can see that there are 1000000 events to process with 60 input files, hence 1000000/60=16666 events/wu.
You can compare that with the tasks id 12236561 for example ("normal tasks"):
https://bigpanda.cern.ch/task/?jeditaskid=12236561. Here 9965000/199300=50 events/wu.
ID: 32770 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 783
Credit: 5,529,459
RAC: 9,405
Message 32771 - Posted: 11 Oct 2017, 4:39:13 UTC - in response to Message 32770.  

Where do you see tasks with the id you mentioned in the queue right now?

I see this in my tasks list.
If in that list I click on this task, the following page opens:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=158823734
ID: 32771 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 783
Credit: 5,529,459
RAC: 9,405
Message 32772 - Posted: 11 Oct 2017, 6:43:45 UTC
Last modified: 11 Oct 2017, 6:46:23 UTC

The situation of the "monster task" at this moment is as follows:

Runtime 1 day and 7 hours.

In console_2, the event count somehow got stuck at 305; this is the highest value which has been shown since this early morning. No increase since.

In console_3, under "mem", from the reserved 7.276.828k some 129.000k are shown as "free".

In the Windows Explorer, the image size shown in the slot directory is 4.576.256kb.

So, any advice from the experts as to how I should proceed? Is there a chance that the task can complete? Or will the reserved memory be exhausted soon, and/or will I run into the rsc_disc_bound problem?

I would hate to block 2 CPU cores for another day or so, and at the end I get an invalid task.

Edit: What I forgot to mention - although this is a 2-core task, 3 athena.py are running (as seen in console_3)
ID: 32772 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 340
Credit: 46,697,229
RAC: 32,944
Message 32773 - Posted: 11 Oct 2017, 7:00:03 UTC
Last modified: 11 Oct 2017, 7:00:22 UTC

Erich56,

No one knows what will happen with your task.

If you are here for the science, let it go

Did you make a change regarding rsc_discbound ? If not, your task will most likely fail

But, even if you really finish your task, it is not guarented that the validator will mark your result as valid.


Supporting BOINC, a great concept !
ID: 32773 · Report as offensive     Reply Quote
gyllic

Send message
Joined: 9 Dec 14
Posts: 151
Credit: 1,678,907
RAC: 2,591
Message 32774 - Posted: 11 Oct 2017, 7:01:21 UTC - in response to Message 32771.  

Where do you see tasks with the id you mentioned in the queue right now?

I see this in my tasks list.
If in that list I click on this task, the following page opens:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=158823734

Ok. We are talkng about two different things. The tasks id you mentioned is a "result id" and you can not see which actual task id you are running.
To see the task id (which decides the duration of the wu) before the job has finnished, you have to look into tthe logs. When its done you can see parts of the log files at the link location you have posted.
ID: 32774 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 783
Credit: 5,529,459
RAC: 9,405
Message 32775 - Posted: 11 Oct 2017, 7:12:35 UTC - in response to Message 32773.  

Erich56,

No one knows what will happen with your task.

If you are here for the science, let it go

Did you make a change regarding rsc_discbound ? If not, your task will most likely fail

But, even if you really finish your task, it is not guarented that the validator will mark your result as valid.

Thanks, Yeti, for your reply. That's about what I am guessing, too. I will let it go and see what happens.

Please let me know how the change regarding rsc_discbound can be made (although, for this task, it may be too late?)
ID: 32775 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 469
Credit: 3,348,239
RAC: 876
Message 32777 - Posted: 11 Oct 2017, 7:40:12 UTC
Last modified: 11 Oct 2017, 7:51:50 UTC

@Erich:

You can find the taskID in stderr.txt in the slot where your BOINC-task is running, but with the PandaID you can find the # of events as a job parameter.
Eg: https://bigpanda.cern.ch/job?pandaid=3650613998
Change in the url the number with your pandaid and search for --maxEvents=


Changing the <rsc_disk_bound>6000000000.000000</rsc_disk_bound> to <rsc_disk_bound>60000000000.000000</rsc_disk_bound> (extra 0) is tricky, but not impossible.

When you have more VM's running, suspend them one after an other (Leave in memory off)
Wait for the next until the suspending one has the state 'Saved' and suspend the next.
After all are saved well, stop BOINC-client (not only BOINC Manager) and edit client_state.xml.
Increase the rsc_disk_bound for your problematic task (or for all ATLAS-tasks) and save the file. Mind that you use a flat ascii editor like notepad.
Restart BOINC and resume the VM's with some interval.
ID: 32777 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 783
Credit: 5,529,459
RAC: 9,405
Message 32778 - Posted: 11 Oct 2017, 9:28:03 UTC

1 hour ago, I made the change in rsc_disk_bound as described by Crystal Pellet (thanks for that). In fact, it was not really complicated.

What I saw in the client_state.xml right under the rsc_disk_bound line was the rsc_memory_bound entry, which, interestingly enough, showed the 4.400MB default value, and not the higher value (7.200MB) from the app_config.xml. So I decided not to make any change (=increase) there, as this would most probably not have any effect.

Is there any other way to increase the 7.200MB value for this current task?

BTW, what somewhat bothers me is that console_2 still shows 305 as the highest event number, so no change since early morning. Could this mean that the task is running in some kind of endless loop?
ID: 32778 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 340
Credit: 46,697,229
RAC: 32,944
Message 32779 - Posted: 11 Oct 2017, 9:37:14 UTC - in response to Message 32778.  

BTW, what somewhat bothers me is that console_2 still shows 305 as the highest event number, so no change since early morning. Could this mean that the task is running in some kind of endless loop?

Nope !

It looks like this output is sorted by hours as for me it is 23:59:xx. But when the screen actualisizes I can see higher eventnumbers scrolling through very fast, my guess is that the time is now 1 day, xx hours and this is a problem in sorting on the actual screen


Supporting BOINC, a great concept !
ID: 32779 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : ATLAS application : WOW 1000 / 5000 events in one WU ? !


©2018 CERN