Message boards :
ATLAS application :
WOW 1000 / 5000 events in one WU ? !
Message board moderation
Author | Message |
---|---|
![]() ![]() Send message Joined: 2 Sep 04 Posts: 450 Credit: 171,060,635 RAC: 25,360 ![]() ![]() ![]() |
|
Send message Joined: 14 Jan 10 Posts: 1168 Credit: 7,217,203 RAC: 2,072 ![]() ![]() ![]() |
How many events are really in it ? Knowing the taskID (in stderr.txt) would help. |
Send message Joined: 18 Dec 15 Posts: 1571 Credit: 66,640,424 RAC: 166,060 ![]() ![]() ![]() |
Yeti, how long did this task run so far? |
Send message Joined: 14 Jan 10 Posts: 1168 Credit: 7,217,203 RAC: 2,072 ![]() ![]() ![]() |
Yeti, how long did this task run so far? The athena's in his picture show 3358 minutes = 56 hours |
![]() ![]() Send message Joined: 2 Sep 04 Posts: 450 Credit: 171,060,635 RAC: 25,360 ![]() ![]() ![]() |
It is/was this task: https://lhcathome.cern.ch/lhcathome/result.php?resultid=158522058 It has already finished with: Disk usage limit exceeded ![]() Supporting BOINC, a great concept ! |
Send message Joined: 14 Jan 10 Posts: 1168 Credit: 7,217,203 RAC: 2,072 ![]() ![]() ![]() |
It is/was this task: https://lhcathome.cern.ch/lhcathome/result.php?resultid=158522058 Yeah, that's one of the batch 12236583 I mentioned in this thread. The task should have finished after 1000 events. I had the disk limit error after I suspended the task and a 2GB snapshot file was written into the slot directory. With you it seems the disk limit exceeded while the task was still running. It looks like only the native application will succeed to finish those tasks or one has to manipulate the rsc_disk_bound in an early stage of such a task. |
![]() ![]() Send message Joined: 18 Dec 16 Posts: 123 Credit: 37,495,365 RAC: 0 ![]() ![]() |
It looks like only the native application will succeed to finish those tasks or one has to manipulate the rsc_disk_bound in an early stage of such a task. So should I abort the tasks that I have been running for more than 3 days? I use VirtualBox and not the native app. We are the product of random evolution. |
![]() ![]() Send message Joined: 2 Sep 04 Posts: 450 Credit: 171,060,635 RAC: 25,360 ![]() ![]() ![]() |
|
Send message Joined: 18 Dec 15 Posts: 1571 Credit: 66,640,424 RAC: 166,060 ![]() ![]() ![]() |
I have a 2-core task running for 16:20 hours now (console 2 shows 221 tasks so far) 4th line from top in console 3 - "mem" - shows the following values: 7.276.828k total - 7.133.176k used - 143.900k free - 67.416k buffers the "used" value is increasing continuously, hence the "free" value is falling continuously. What will happen if the free memory is used up? I guess the task will break off. I was hoping that the memory value of 7300MB in the app_config.xml would be sufficient at any rate. Which is obviously not the case this time :-( |
Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 20 ![]() ![]() |
Hi all, Something has gone wrong in our task submission, it's not intentional to have such long WU! I'll consult with the experts tomorrow, but in the meantime I've increased the disk requirement to 8GB so that if you get one of these monster WUs it can have a chance to complete. |
Send message Joined: 18 Dec 15 Posts: 1571 Credit: 66,640,424 RAC: 166,060 ![]() ![]() ![]() |
David, can you tell how many events task 158823734 is scheduled to contain? Right now, the count is at 267. If the total is 1000, then I guess it wouldn't make sense to continue, because I would very likely run into the "disc-bound" problem, right? Besides that most probably I would run out of memory, too. |
Send message Joined: 9 Dec 14 Posts: 202 Credit: 2,533,875 RAC: 0 ![]() ![]() |
David, can you tell how many events task 158823734 is scheduled to contain? Where do you see tasks with the id you mentioned in the queue right now? I assume you are talking about the 12236583 tasks, because these are the "monster tasks". I assume (i dont know if that calculation is actually correct): If you look here: https://bigpanda.cern.ch/task/?jeditaskid=12236583 you can see that there are 1000000 events to process with 60 input files, hence 1000000/60=16666 events/wu. You can compare that with the tasks id 12236561 for example ("normal tasks"): https://bigpanda.cern.ch/task/?jeditaskid=12236561. Here 9965000/199300=50 events/wu. |
Send message Joined: 18 Dec 15 Posts: 1571 Credit: 66,640,424 RAC: 166,060 ![]() ![]() ![]() |
Where do you see tasks with the id you mentioned in the queue right now? I see this in my tasks list. If in that list I click on this task, the following page opens: https://lhcathome.cern.ch/lhcathome/result.php?resultid=158823734 |
Send message Joined: 18 Dec 15 Posts: 1571 Credit: 66,640,424 RAC: 166,060 ![]() ![]() ![]() |
The situation of the "monster task" at this moment is as follows: Runtime 1 day and 7 hours. In console_2, the event count somehow got stuck at 305; this is the highest value which has been shown since this early morning. No increase since. In console_3, under "mem", from the reserved 7.276.828k some 129.000k are shown as "free". In the Windows Explorer, the image size shown in the slot directory is 4.576.256kb. So, any advice from the experts as to how I should proceed? Is there a chance that the task can complete? Or will the reserved memory be exhausted soon, and/or will I run into the rsc_disc_bound problem? I would hate to block 2 CPU cores for another day or so, and at the end I get an invalid task. Edit: What I forgot to mention - although this is a 2-core task, 3 athena.py are running (as seen in console_3) |
![]() ![]() Send message Joined: 2 Sep 04 Posts: 450 Credit: 171,060,635 RAC: 25,360 ![]() ![]() ![]() |
Erich56, No one knows what will happen with your task. If you are here for the science, let it go Did you make a change regarding rsc_discbound ? If not, your task will most likely fail But, even if you really finish your task, it is not guarented that the validator will mark your result as valid. ![]() Supporting BOINC, a great concept ! |
Send message Joined: 9 Dec 14 Posts: 202 Credit: 2,533,875 RAC: 0 ![]() ![]() |
Where do you see tasks with the id you mentioned in the queue right now? Ok. We are talkng about two different things. The tasks id you mentioned is a "result id" and you can not see which actual task id you are running. To see the task id (which decides the duration of the wu) before the job has finnished, you have to look into tthe logs. When its done you can see parts of the log files at the link location you have posted. |
Send message Joined: 18 Dec 15 Posts: 1571 Credit: 66,640,424 RAC: 166,060 ![]() ![]() ![]() |
Erich56, Thanks, Yeti, for your reply. That's about what I am guessing, too. I will let it go and see what happens. Please let me know how the change regarding rsc_discbound can be made (although, for this task, it may be too late?) |
Send message Joined: 14 Jan 10 Posts: 1168 Credit: 7,217,203 RAC: 2,072 ![]() ![]() ![]() |
@Erich: You can find the taskID in stderr.txt in the slot where your BOINC-task is running, but with the PandaID you can find the # of events as a job parameter. Eg: https://bigpanda.cern.ch/job?pandaid=3650613998 Change in the url the number with your pandaid and search for --maxEvents= Changing the <rsc_disk_bound>6000000000.000000</rsc_disk_bound> to <rsc_disk_bound>60000000000.000000</rsc_disk_bound> (extra 0) is tricky, but not impossible. When you have more VM's running, suspend them one after an other (Leave in memory off) Wait for the next until the suspending one has the state 'Saved' and suspend the next. After all are saved well, stop BOINC-client (not only BOINC Manager) and edit client_state.xml. Increase the rsc_disk_bound for your problematic task (or for all ATLAS-tasks) and save the file. Mind that you use a flat ascii editor like notepad. Restart BOINC and resume the VM's with some interval. |
Send message Joined: 18 Dec 15 Posts: 1571 Credit: 66,640,424 RAC: 166,060 ![]() ![]() ![]() |
1 hour ago, I made the change in rsc_disk_bound as described by Crystal Pellet (thanks for that). In fact, it was not really complicated. What I saw in the client_state.xml right under the rsc_disk_bound line was the rsc_memory_bound entry, which, interestingly enough, showed the 4.400MB default value, and not the higher value (7.200MB) from the app_config.xml. So I decided not to make any change (=increase) there, as this would most probably not have any effect. Is there any other way to increase the 7.200MB value for this current task? BTW, what somewhat bothers me is that console_2 still shows 305 as the highest event number, so no change since early morning. Could this mean that the task is running in some kind of endless loop? |
![]() ![]() Send message Joined: 2 Sep 04 Posts: 450 Credit: 171,060,635 RAC: 25,360 ![]() ![]() ![]() |
BTW, what somewhat bothers me is that console_2 still shows 305 as the highest event number, so no change since early morning. Could this mean that the task is running in some kind of endless loop? Nope ! It looks like this output is sorted by hours as for me it is 23:59:xx. But when the screen actualisizes I can see higher eventnumbers scrolling through very fast, my guess is that the time is now 1 day, xx hours and this is a problem in sorting on the actual screen ![]() Supporting BOINC, a great concept ! |
©2023 CERN