Message boards :
Sixtrack Application :
Very long Tasks
Message board moderation
Author | Message |
---|---|
Send message Joined: 2 May 07 Posts: 2243 Credit: 173,902,375 RAC: 1,652 |
1.9 PetaFlops and more than 24 hours Duration-Tme. Is this ok? https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=116598728 w-c1_job.B2topenergy.b6offIRon_c1.1707__3__s__62.31_60.32__14.1_16.1__7__79.5_1_sixvf_boinc1174 or workspace1_HEL_Qp_2_MO_m150_2t_3s__14__s__62.31_60.32__4_6__6__35_1_sixvf_boinc4954 |
Send message Joined: 28 Sep 04 Posts: 728 Credit: 49,144,463 RAC: 29,814 |
I've got a couple of those as well. Estimated runtime about 33 hours. One has been running 50 minutes <5% progress and that gives about 18.8 hours if progress keeps steady. |
Send message Joined: 28 Sep 04 Posts: 728 Credit: 49,144,463 RAC: 29,814 |
Looks like these aren't that long in the end. Current progress indicates about 7.5 hours of runtime. |
Send message Joined: 29 Sep 04 Posts: 281 Credit: 11,866,264 RAC: 0 |
Longest one for me was 15hrs 29mins on the 2.6GHz i5 which ran 8hrs 19mins on the wingman's 2.8GHz i7. I've also had a few 14+hrs tasks. This is all good as it shows a stable (therefore more useful) beam configuration rather than short runners which hit the wall prematurely. Just followed the link in the original post which points to a WU of only 19 seconds? Is the link wrong? Did the WU eventually finish? |
Send message Joined: 2 May 07 Posts: 2243 Credit: 173,902,375 RAC: 1,652 |
First Longrunner after 10 hours with 196 (0x000000C4) EXIT_DISK_LIMIT_EXCEEDED ended https://lhcathome.cern.ch/lhcathome/result.php?resultid=232584711 Sa 15 Jun 2019 09:16:02 CEST | | OS: Linux CentOS Linux: CentOS Linux 7 (Core) [3.10.0-693.el7.x86_64|libc 2.17 (GNU libc)] Sa 15 Jun 2019 09:16:02 CEST | | Memory: 8.17 GB physical, 1.90 GB virtual Sa 15 Jun 2019 09:16:02 CEST | | Disk: 16.08 GB total, 11.11 GB free |
Send message Joined: 29 Feb 16 Posts: 157 Credit: 2,659,975 RAC: 0 |
Hi all, thanks for noticing this and giving feedback. The WU named "w-c1_job.B2topenergy.b6offIRon_c1.1707__7__s__62.31_60.32__8.1_10.1__7__21_1_sixvf_boinc2610.zip" belongs to a study were tracking is performed for 10^7 turns - corresponding to 800s of beam in the LHC. In general we simulate 10^5 or 10^6; 10^7 is pretty unusual, even though it is a time scale we would like to hit sooner or later. I think that the reason for the EXIT_DISK_LIMIT_EXCEEDED error is the fact that the user requested a dump of the beam coordinates every 50k turns, and the file collecting those data grows up until the total disk space we request (~200MB) is filled up. I fear that we have to kill those tasks, and resubmit them with an updated result template file - I am running the same task locally to better estimate the requirement. For the other task, I cannot spot anything odd at first sight - I have downloaded and am running the task locally, to see if there is anything wrong with that. I will keep you posted, Cheers, A. |
Send message Joined: 29 Sep 04 Posts: 281 Credit: 11,866,264 RAC: 0 |
Hi Alessio, I hadn't spotted the ... _7_... I have w-c2_job.B2topenergy.b6onIRon_c2.1707__8__s__62.31_60.32__4.1_6.1__7__84_1_sixvf_boinc2947 which is currently 5% in after 5.5hrs with a little over 4 days remaining. My hosts run 24/7 so that's fine. Slot size is 11.9MB, the same as other "normal" WUs, so no size issue yet. Is there a duration limit as well as size limit? Is there anything I could edit at this end to increase the limits before they are reached? [also, hiding on the other machine w-c1_job.B2topenergy.b6offIRon_c1.1707__10__s__62.31_60.32__8.1_10.1__7__69_1_sixvf_boinc3881 at 22% after 15hrs] Found what I was looking for, hopefully. init_data Would it be possible to exit Boinc and simply edit <rsc_disk_bound>200000000.000000</rsc_disk_bound> to place an extra zero in there? Simple answer, NO, it gets reset to original value, but at least I didn't kill it. Did the same edit while Active and it's accepted that but don't know if the init_data gets read from again while the job is active so might not have any effect. |
Send message Joined: 29 Feb 16 Posts: 157 Credit: 2,659,975 RAC: 0 |
Hi Ray, Interesting discovery, but honestly I don't have direct experience on this, so I cannot tell you much. The only meaningful info I have found around is not super-encouraging: [url][https://boinc.mundayweb.com/wiki/index.php?title=Maximum_disk_space_exceeded/url] Maybe the IT guys can give some further insights. Concerning the max time, we give a week of max return time before a task is declared as obsolete and a brand new one is issued to another volunteer. If the task lasts less than a day, the limit is fine. If it takes 4 days, it is a bit short, evidently... It is somehow strange that your task at 5% takes such a small disk space - it should grow roughly like 10MB every 1% ... |
Send message Joined: 29 Sep 04 Posts: 281 Credit: 11,866,264 RAC: 0 |
I may have gotten the slot size wrong as it was up to 55MB when I had to leave and shows a Peak disk usage of 60 when it was cancelled but the other one cited shows peak usage of only 11.98MB after its 25?% and 18hrs. The other topenergy ones I have had have been of higher amplitude and therefore finished much sooner. I knew there was a return deadline but I didn't know if there was a job_duration deadline (it would appear not) such as is in place for Theory jobs that have a return deadline set but will self-terminate after 18hrs of runtime, which can be increased by the user editing the Theory xml. |
Send message Joined: 8 Aug 05 Posts: 1 Credit: 637,785 RAC: 0 |
Thanks for killing the task w-c2_job.B2topenergy.b6onIRon_c2.1707__5__s__62.31_60.32__8.1_10.1__7__58.5_1_sixvf_boinc1809_0 that ended normally on 18 Jun 2019, 20:04:07 UTC after 18.5 hrs of CPU time. |
Send message Joined: 29 Feb 16 Posts: 157 Credit: 2,659,975 RAC: 0 |
I terribly apology for that - it was not a decision taken easily or instantly, but I wanted to avoid having other 5 volunteers per long WU complaining. |
Send message Joined: 6 Sep 13 Posts: 5 Credit: 1,286,288 RAC: 0 |
Is there still a recurring problem? Just aborted 5 tasks with estimate 155 days compute time. All other tasks have ETA 2 hours and 30 minutes roughly. At first I thought my memory settings might have become unstable due to BCLK modification but I checked with google stressapp for 3 hours to confirm stability and Prime95 didn't have any errors either. |
Send message Joined: 29 Feb 16 Posts: 157 Credit: 2,659,975 RAC: 0 |
could you point me to the concerned tasks? I cannot get them - your computers are 'hidden' |
Send message Joined: 6 Sep 13 Posts: 5 Credit: 1,286,288 RAC: 0 |
Sorry about that I PMed you the Workunits in question. I keep PCs hidden for security reasons , I figure that admins can use the backend. |
Send message Joined: 15 Jun 08 Posts: 2534 Credit: 254,137,209 RAC: 54,451 |
I keep PCs hidden for security reasons Of course it's your decision whether to hide your hosts or not, but you must not tell anybody that to hide them increases security. That's just a myth. |
Send message Joined: 6 Sep 13 Posts: 5 Credit: 1,286,288 RAC: 0 |
Going completely offtopic, but please tell me how it is a myth. If someone wants to find the host they can match credits if you use one host, however it is extra steps. I've run this app on at least 5 different machines so that doesn't apply to me as much. Putting a host up with your exact kernel / glibc / OS version is less obscure especially if you're not running in a VM. Even if you are running in a VM , Intel hardware hasn't been fully patched yet. |
Send message Joined: 14 Feb 17 Posts: 1 Credit: 351,918 RAC: 0 |
I got a very long task yesterday, that failed after reaching 200 MB size on the local HDD. The name of the WU is w-c4_job.B1topenergy.b6offIRon_c4.1707__5__s__62.31_60.32__12.1_14.1__7__9_1_sixvf_boinc1894_3 https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=116651922 |
Send message Joined: 2 May 07 Posts: 2243 Credit: 173,902,375 RAC: 1,652 |
Longrunner in sixtracktest with 40 hours Cpu and 1.9 PetaFlops successful!. https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=117930714 |
Send message Joined: 29 Feb 16 Posts: 157 Credit: 2,659,975 RAC: 0 |
Hello, maex and Win10,
I apology for that - that task belong to the series of extremely long jobs (10^7 turns) which were submitted with a wrong request of disk space. This task in particular was not killed by my sudden kill announced to the MB: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5064 The amplitude range covered by your job was outside the range for killing - sorry, I tried to kill as many as possible, trying to minimise the number of upset volunteers...
That is a new batch of extremely long jobs, with the correct request of disk space. I got one as well and crunched it correctly: https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=117930647 I asked the user to submit few of these jobs on sixtracktest, just to check that everything was correctly set-up before making (again) a big mess. I think he can proceed. Thanks for the feedback, and keep up the good work! Happy crunching, A. |
Send message Joined: 9 Aug 05 Posts: 36 Credit: 7,698,293 RAC: 0 |
Aren't those long task more suited to GPU's? Long running time on CPU's |
©2024 CERN