Thread 'WOW 1000 / 5000 events in one WU ? !'

Author	Message
Yeti Volunteer moderator Send message Joined: 2 Sep 04 Posts: 468 Credit: 224,935,712 RAC: 490	Message 32652 - Posted: 6 Oct 2017, 13:28:44 UTC Hi, just found: It is 5-core-task, but I can see 6 athena.py ? How many events are really in it ? Supporting BOINC, a great concept ! ID: 32652 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1561 Credit: 10,121,513 RAC: 1,559	Message 32655 - Posted: 6 Oct 2017, 15:58:44 UTC - in response to Message 32652. How many events are really in it ? Knowing the taskID (in stderr.txt) would help. ID: 32655 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1993 Credit: 164,075,306 RAC: 115,365	Message 32657 - Posted: 6 Oct 2017, 16:39:20 UTC Yeti, how long did this task run so far? ID: 32657 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1561 Credit: 10,121,513 RAC: 1,559	Message 32659 - Posted: 6 Oct 2017, 17:34:23 UTC - in response to Message 32657. Yeti, how long did this task run so far? The athena's in his picture show 3358 minutes = 56 hours ID: 32659 · Reply Quote

Yeti Volunteer moderator Send message Joined: 2 Sep 04 Posts: 468 Credit: 224,935,712 RAC: 490	Message 32660 - Posted: 6 Oct 2017, 18:03:58 UTC It is/was this task: https://lhcathome.cern.ch/lhcathome/result.php?resultid=158522058 It has already finished with: Disk usage limit exceeded Supporting BOINC, a great concept ! ID: 32660 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1561 Credit: 10,121,513 RAC: 1,559	Message 32661 - Posted: 6 Oct 2017, 19:46:52 UTC - in response to Message 32660. It is/was this task: https://lhcathome.cern.ch/lhcathome/result.php?resultid=158522058 It has already finished with: Disk usage limit exceeded Yeah, that's one of the batch 12236583 I mentioned in this thread. The task should have finished after 1000 events. I had the disk limit error after I suspended the task and a 2GB snapshot file was written into the slot directory. With you it seems the disk limit exceeded while the task was still running. It looks like only the native application will succeed to finish those tasks or one has to manipulate the rsc_disk_bound in an early stage of such a task. ID: 32661 · Reply Quote

HerveUAE Send message Joined: 18 Dec 16 Posts: 123 Credit: 37,495,365 RAC: 0	Message 32663 - Posted: 7 Oct 2017, 4:44:58 UTC - in response to Message 32661. It looks like only the native application will succeed to finish those tasks or one has to manipulate the rsc_disk_bound in an early stage of such a task. So should I abort the tasks that I have been running for more than 3 days? I use VirtualBox and not the native app. We are the product of random evolution. ID: 32663 · Reply Quote

Yeti Volunteer moderator Send message Joined: 2 Sep 04 Posts: 468 Credit: 224,935,712 RAC: 490	Message 32759 - Posted: 10 Oct 2017, 15:23:57 UTC - in response to Message 32661. ... or one has to manipulate the rsc_disk_bound in an early stage of such a task. I just gave this a try, I changed rsc_discbound from 6xxxxxxxx to 60xxxxxxxx The WU is now up and running again, I'm looking forward what will happen Supporting BOINC, a great concept ! ID: 32759 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1993 Credit: 164,075,306 RAC: 115,365	Message 32760 - Posted: 10 Oct 2017, 15:52:19 UTC Last modified: 10 Oct 2017, 16:19:26 UTC I have a 2-core task running for 16:20 hours now (console 2 shows 221 tasks so far) 4th line from top in console 3 - "mem" - shows the following values: 7.276.828k total - 7.133.176k used - 143.900k free - 67.416k buffers the "used" value is increasing continuously, hence the "free" value is falling continuously. What will happen if the free memory is used up? I guess the task will break off. I was hoping that the memory value of 7300MB in the app_config.xml would be sufficient at any rate. Which is obviously not the case this time :-( ID: 32760 · Reply Quote

David Cameron Project administrator Project developer Project scientist Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0	Message 32764 - Posted: 10 Oct 2017, 19:20:13 UTC Hi all, Something has gone wrong in our task submission, it's not intentional to have such long WU! I'll consult with the experts tomorrow, but in the meantime I've increased the disk requirement to 8GB so that if you get one of these monster WUs it can have a chance to complete. ID: 32764 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1993 Credit: 164,075,306 RAC: 115,365	Message 32765 - Posted: 10 Oct 2017, 19:31:41 UTC David, can you tell how many events task 158823734 is scheduled to contain? Right now, the count is at 267. If the total is 1000, then I guess it wouldn't make sense to continue, because I would very likely run into the "disc-bound" problem, right? Besides that most probably I would run out of memory, too. ID: 32765 · Reply Quote

gyllic Send message Joined: 9 Dec 14 Posts: 202 Credit: 2,660,212 RAC: 0	Message 32770 - Posted: 11 Oct 2017, 3:57:44 UTC - in response to Message 32765. David, can you tell how many events task 158823734 is scheduled to contain? Where do you see tasks with the id you mentioned in the queue right now? I assume you are talking about the 12236583 tasks, because these are the "monster tasks". I assume (i dont know if that calculation is actually correct): If you look here: https://bigpanda.cern.ch/task/?jeditaskid=12236583 you can see that there are 1000000 events to process with 60 input files, hence 1000000/60=16666 events/wu. You can compare that with the tasks id 12236561 for example ("normal tasks"): https://bigpanda.cern.ch/task/?jeditaskid=12236561. Here 9965000/199300=50 events/wu. ID: 32770 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1993 Credit: 164,075,306 RAC: 115,365	Message 32771 - Posted: 11 Oct 2017, 4:39:13 UTC - in response to Message 32770. Where do you see tasks with the id you mentioned in the queue right now? I see this in my tasks list. If in that list I click on this task, the following page opens: https://lhcathome.cern.ch/lhcathome/result.php?resultid=158823734 ID: 32771 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1993 Credit: 164,075,306 RAC: 115,365	Message 32772 - Posted: 11 Oct 2017, 6:43:45 UTC Last modified: 11 Oct 2017, 6:46:23 UTC The situation of the "monster task" at this moment is as follows: Runtime 1 day and 7 hours. In console_2, the event count somehow got stuck at 305; this is the highest value which has been shown since this early morning. No increase since. In console_3, under "mem", from the reserved 7.276.828k some 129.000k are shown as "free". In the Windows Explorer, the image size shown in the slot directory is 4.576.256kb. So, any advice from the experts as to how I should proceed? Is there a chance that the task can complete? Or will the reserved memory be exhausted soon, and/or will I run into the rsc_disc_bound problem? I would hate to block 2 CPU cores for another day or so, and at the end I get an invalid task. Edit: What I forgot to mention - although this is a 2-core task, 3 athena.py are running (as seen in console_3) ID: 32772 · Reply Quote

Yeti Volunteer moderator Send message Joined: 2 Sep 04 Posts: 468 Credit: 224,935,712 RAC: 490	Message 32773 - Posted: 11 Oct 2017, 7:00:03 UTC Last modified: 11 Oct 2017, 7:00:22 UTC Erich56, No one knows what will happen with your task. If you are here for the science, let it go Did you make a change regarding rsc_discbound ? If not, your task will most likely fail But, even if you really finish your task, it is not guarented that the validator will mark your result as valid. Supporting BOINC, a great concept ! ID: 32773 · Reply Quote

gyllic Send message Joined: 9 Dec 14 Posts: 202 Credit: 2,660,212 RAC: 0	Message 32774 - Posted: 11 Oct 2017, 7:01:21 UTC - in response to Message 32771. Where do you see tasks with the id you mentioned in the queue right now? I see this in my tasks list. If in that list I click on this task, the following page opens: https://lhcathome.cern.ch/lhcathome/result.php?resultid=158823734 Ok. We are talkng about two different things. The tasks id you mentioned is a "result id" and you can not see which actual task id you are running. To see the task id (which decides the duration of the wu) before the job has finnished, you have to look into tthe logs. When its done you can see parts of the log files at the link location you have posted. ID: 32774 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1993 Credit: 164,075,306 RAC: 115,365	Message 32775 - Posted: 11 Oct 2017, 7:12:35 UTC - in response to Message 32773. Erich56, No one knows what will happen with your task. If you are here for the science, let it go Did you make a change regarding rsc_discbound ? If not, your task will most likely fail But, even if you really finish your task, it is not guarented that the validator will mark your result as valid. Thanks, Yeti, for your reply. That's about what I am guessing, too. I will let it go and see what happens. Please let me know how the change regarding rsc_discbound can be made (although, for this task, it may be too late?) ID: 32775 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1561 Credit: 10,121,513 RAC: 1,559	Message 32777 - Posted: 11 Oct 2017, 7:40:12 UTC Last modified: 11 Oct 2017, 7:51:50 UTC @Erich: You can find the taskID in stderr.txt in the slot where your BOINC-task is running, but with the PandaID you can find the # of events as a job parameter. Eg: https://bigpanda.cern.ch/job?pandaid=3650613998 Change in the url the number with your pandaid and search for --maxEvents= Changing the <rsc_disk_bound>6000000000.000000</rsc_disk_bound> to <rsc_disk_bound>60000000000.000000</rsc_disk_bound> (extra 0) is tricky, but not impossible. When you have more VM's running, suspend them one after an other (Leave in memory off) Wait for the next until the suspending one has the state 'Saved' and suspend the next. After all are saved well, stop BOINC-client (not only BOINC Manager) and edit client_state.xml. Increase the rsc_disk_bound for your problematic task (or for all ATLAS-tasks) and save the file. Mind that you use a flat ascii editor like notepad. Restart BOINC and resume the VM's with some interval. ID: 32777 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1993 Credit: 164,075,306 RAC: 115,365	Message 32778 - Posted: 11 Oct 2017, 9:28:03 UTC 1 hour ago, I made the change in rsc_disk_bound as described by Crystal Pellet (thanks for that). In fact, it was not really complicated. What I saw in the client_state.xml right under the rsc_disk_bound line was the rsc_memory_bound entry, which, interestingly enough, showed the 4.400MB default value, and not the higher value (7.200MB) from the app_config.xml. So I decided not to make any change (=increase) there, as this would most probably not have any effect. Is there any other way to increase the 7.200MB value for this current task? BTW, what somewhat bothers me is that console_2 still shows 305 as the highest event number, so no change since early morning. Could this mean that the task is running in some kind of endless loop? ID: 32778 · Reply Quote

Yeti Volunteer moderator Send message Joined: 2 Sep 04 Posts: 468 Credit: 224,935,712 RAC: 490	Message 32779 - Posted: 11 Oct 2017, 9:37:14 UTC - in response to Message 32778. BTW, what somewhat bothers me is that console_2 still shows 305 as the highest event number, so no change since early morning. Could this mean that the task is running in some kind of endless loop? Nope ! It looks like this output is sorted by hours as for me it is 23:59:xx. But when the screen actualisizes I can see higher eventnumbers scrolling through very fast, my guess is that the time is now 1 day, xx hours and this is a problem in sorting on the actual screen Supporting BOINC, a great concept ! ID: 32779 · Reply Quote