Message boards : Sixtrack Application : Very long Tasks
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
maeax

Send message
Joined: 2 May 07
Posts: 2243
Credit: 173,902,375
RAC: 1,652
Message 39139 - Posted: 17 Jun 2019, 15:57:51 UTC

1.9 PetaFlops and more than 24 hours Duration-Tme. Is this ok?
https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=116598728
w-c1_job.B2topenergy.b6offIRon_c1.1707__3__s__62.31_60.32__14.1_16.1__7__79.5_1_sixvf_boinc1174 or
workspace1_HEL_Qp_2_MO_m150_2t_3s__14__s__62.31_60.32__4_6__6__35_1_sixvf_boinc4954
ID: 39139 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 728
Credit: 49,144,463
RAC: 29,814
Message 39142 - Posted: 17 Jun 2019, 17:58:01 UTC

I've got a couple of those as well. Estimated runtime about 33 hours. One has been running 50 minutes <5% progress and that gives about 18.8 hours if progress keeps steady.
ID: 39142 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 728
Credit: 49,144,463
RAC: 29,814
Message 39143 - Posted: 17 Jun 2019, 19:55:30 UTC - in response to Message 39142.  

Looks like these aren't that long in the end. Current progress indicates about 7.5 hours of runtime.
ID: 39143 · Report as offensive     Reply Quote
Profile Ray Murray
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 281
Credit: 11,866,264
RAC: 0
Message 39144 - Posted: 17 Jun 2019, 20:28:31 UTC
Last modified: 17 Jun 2019, 20:40:19 UTC

Longest one for me was 15hrs 29mins on the 2.6GHz i5 which ran 8hrs 19mins on the wingman's 2.8GHz i7. I've also had a few 14+hrs tasks. This is all good as it shows a stable (therefore more useful) beam configuration rather than short runners which hit the wall prematurely.

Just followed the link in the original post which points to a WU of only 19 seconds?
Is the link wrong? Did the WU eventually finish?
ID: 39144 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2243
Credit: 173,902,375
RAC: 1,652
Message 39145 - Posted: 18 Jun 2019, 5:22:00 UTC
Last modified: 18 Jun 2019, 5:35:11 UTC

First Longrunner after 10 hours with
196 (0x000000C4) EXIT_DISK_LIMIT_EXCEEDED ended
https://lhcathome.cern.ch/lhcathome/result.php?resultid=232584711
Sa 15 Jun 2019 09:16:02 CEST | | OS: Linux CentOS Linux: CentOS Linux 7 (Core) [3.10.0-693.el7.x86_64|libc 2.17 (GNU libc)]
Sa 15 Jun 2019 09:16:02 CEST | | Memory: 8.17 GB physical, 1.90 GB virtual
Sa 15 Jun 2019 09:16:02 CEST | | Disk: 16.08 GB total, 11.11 GB free
ID: 39145 · Report as offensive     Reply Quote
Alessio Mereghetti
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 29 Feb 16
Posts: 157
Credit: 2,659,975
RAC: 0
Message 39147 - Posted: 18 Jun 2019, 8:15:50 UTC - in response to Message 39145.  

Hi all,
thanks for noticing this and giving feedback.

The WU named "w-c1_job.B2topenergy.b6offIRon_c1.1707__7__s__62.31_60.32__8.1_10.1__7__21_1_sixvf_boinc2610.zip" belongs to a study were tracking is performed for 10^7 turns - corresponding to 800s of beam in the LHC.
In general we simulate 10^5 or 10^6; 10^7 is pretty unusual, even though it is a time scale we would like to hit sooner or later.
I think that the reason for the EXIT_DISK_LIMIT_EXCEEDED error is the fact that the user requested a dump of the beam coordinates every 50k turns, and the file collecting those data grows up until the total disk space we request (~200MB) is filled up.
I fear that we have to kill those tasks, and resubmit them with an updated result template file - I am running the same task locally to better estimate the requirement.

For the other task, I cannot spot anything odd at first sight - I have downloaded and am running the task locally, to see if there is anything wrong with that.
I will keep you posted,
Cheers,
A.
ID: 39147 · Report as offensive     Reply Quote
Profile Ray Murray
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 281
Credit: 11,866,264
RAC: 0
Message 39148 - Posted: 18 Jun 2019, 12:13:56 UTC
Last modified: 18 Jun 2019, 13:03:08 UTC

Hi Alessio,
I hadn't spotted the ... _7_...
I have w-c2_job.B2topenergy.b6onIRon_c2.1707__8__s__62.31_60.32__4.1_6.1__7__84_1_sixvf_boinc2947 which is currently 5% in after 5.5hrs with a little over 4 days remaining. My hosts run 24/7 so that's fine. Slot size is 11.9MB, the same as other "normal" WUs, so no size issue yet. Is there a duration limit as well as size limit? Is there anything I could edit at this end to increase the limits before they are reached?

[also, hiding on the other machine w-c1_job.B2topenergy.b6offIRon_c1.1707__10__s__62.31_60.32__8.1_10.1__7__69_1_sixvf_boinc3881 at 22% after 15hrs]

Found what I was looking for, hopefully.
init_data
Would it be possible to exit Boinc and simply edit
<rsc_disk_bound>200000000.000000</rsc_disk_bound> to place an extra zero in there?

Simple answer, NO, it gets reset to original value, but at least I didn't kill it. Did the same edit while Active and it's accepted that but don't know if the init_data gets read from again while the job is active so might not have any effect.
ID: 39148 · Report as offensive     Reply Quote
Alessio Mereghetti
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 29 Feb 16
Posts: 157
Credit: 2,659,975
RAC: 0
Message 39149 - Posted: 18 Jun 2019, 13:39:56 UTC - in response to Message 39148.  

Hi Ray,

Interesting discovery, but honestly I don't have direct experience on this, so I cannot tell you much.
The only meaningful info I have found around is not super-encouraging:
[url][https://boinc.mundayweb.com/wiki/index.php?title=Maximum_disk_space_exceeded/url]
Maybe the IT guys can give some further insights.

Concerning the max time, we give a week of max return time before a task is declared as obsolete and a brand new one is issued to another volunteer. If the task lasts less than a day, the limit is fine. If it takes 4 days, it is a bit short, evidently...
It is somehow strange that your task at 5% takes such a small disk space - it should grow roughly like 10MB every 1% ...
ID: 39149 · Report as offensive     Reply Quote
Profile Ray Murray
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 281
Credit: 11,866,264
RAC: 0
Message 39151 - Posted: 18 Jun 2019, 20:20:20 UTC
Last modified: 18 Jun 2019, 20:22:42 UTC

I may have gotten the slot size wrong as it was up to 55MB when I had to leave and shows a Peak disk usage of 60 when it was cancelled but the other one cited shows peak usage of only 11.98MB after its 25?% and 18hrs.
The other topenergy ones I have had have been of higher amplitude and therefore finished much sooner.

I knew there was a return deadline but I didn't know if there was a job_duration deadline (it would appear not) such as is in place for Theory jobs that have a return deadline set but will self-terminate after 18hrs of runtime, which can be increased by the user editing the Theory xml.
ID: 39151 · Report as offensive     Reply Quote
SAHJ@H

Send message
Joined: 8 Aug 05
Posts: 1
Credit: 637,785
RAC: 0
Message 39152 - Posted: 19 Jun 2019, 7:42:15 UTC
Last modified: 19 Jun 2019, 7:44:44 UTC

Thanks for killing the task w-c2_job.B2topenergy.b6onIRon_c2.1707__5__s__62.31_60.32__8.1_10.1__7__58.5_1_sixvf_boinc1809_0 that ended normally on 18 Jun 2019, 20:04:07 UTC after 18.5 hrs of CPU time.
ID: 39152 · Report as offensive     Reply Quote
Alessio Mereghetti
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 29 Feb 16
Posts: 157
Credit: 2,659,975
RAC: 0
Message 39153 - Posted: 19 Jun 2019, 12:40:50 UTC - in response to Message 39152.  

I terribly apology for that - it was not a decision taken easily or instantly, but I wanted to avoid having other 5 volunteers per long WU complaining.
ID: 39153 · Report as offensive     Reply Quote
AlphaC

Send message
Joined: 6 Sep 13
Posts: 5
Credit: 1,286,288
RAC: 0
Message 39198 - Posted: 26 Jun 2019, 20:52:42 UTC - in response to Message 39153.  
Last modified: 26 Jun 2019, 20:56:03 UTC

Is there still a recurring problem?

Just aborted 5 tasks with estimate 155 days compute time.

All other tasks have ETA 2 hours and 30 minutes roughly.

At first I thought my memory settings might have become unstable due to BCLK modification but I checked with google stressapp for 3 hours to confirm stability and Prime95 didn't have any errors either.
ID: 39198 · Report as offensive     Reply Quote
Alessio Mereghetti
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 29 Feb 16
Posts: 157
Credit: 2,659,975
RAC: 0
Message 39199 - Posted: 27 Jun 2019, 7:22:24 UTC - in response to Message 39198.  

could you point me to the concerned tasks?
I cannot get them - your computers are 'hidden'
ID: 39199 · Report as offensive     Reply Quote
AlphaC

Send message
Joined: 6 Sep 13
Posts: 5
Credit: 1,286,288
RAC: 0
Message 39206 - Posted: 27 Jun 2019, 17:09:27 UTC - in response to Message 39199.  

Sorry about that I PMed you the Workunits in question.

I keep PCs hidden for security reasons , I figure that admins can use the backend.
ID: 39206 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2534
Credit: 254,137,209
RAC: 54,451
Message 39207 - Posted: 27 Jun 2019, 17:38:46 UTC - in response to Message 39206.  

I keep PCs hidden for security reasons

Of course it's your decision whether to hide your hosts or not, but you must not tell anybody that to hide them increases security.
That's just a myth.
ID: 39207 · Report as offensive     Reply Quote
AlphaC

Send message
Joined: 6 Sep 13
Posts: 5
Credit: 1,286,288
RAC: 0
Message 39211 - Posted: 27 Jun 2019, 19:33:43 UTC - in response to Message 39207.  

Going completely offtopic, but please tell me how it is a myth. If someone wants to find the host they can match credits if you use one host, however it is extra steps.

I've run this app on at least 5 different machines so that doesn't apply to me as much.

Putting a host up with your exact kernel / glibc / OS version is less obscure especially if you're not running in a VM. Even if you are running in a VM , Intel hardware hasn't been fully patched yet.
ID: 39211 · Report as offensive     Reply Quote
Win10

Send message
Joined: 14 Feb 17
Posts: 1
Credit: 351,918
RAC: 0
Message 39214 - Posted: 27 Jun 2019, 21:52:57 UTC
Last modified: 27 Jun 2019, 22:10:43 UTC

I got a very long task yesterday, that failed after reaching 200 MB size on the local HDD.

The name of the WU is
w-c4_job.B1topenergy.b6offIRon_c4.1707__5__s__62.31_60.32__12.1_14.1__7__9_1_sixvf_boinc1894_3
https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=116651922
ID: 39214 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2243
Credit: 173,902,375
RAC: 1,652
Message 39233 - Posted: 30 Jun 2019, 15:49:56 UTC

Longrunner in sixtracktest with 40 hours Cpu and 1.9 PetaFlops successful!.
https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=117930714
ID: 39233 · Report as offensive     Reply Quote
Alessio Mereghetti
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 29 Feb 16
Posts: 157
Credit: 2,659,975
RAC: 0
Message 39239 - Posted: 1 Jul 2019, 13:43:03 UTC - in response to Message 39233.  

Hello, maex and Win10,

I got a very long task yesterday, that failed after reaching 200 MB size on the local HDD.

The name of the WU is
w-c4_job.B1topenergy.b6offIRon_c4.1707__5__s__62.31_60.32__12.1_14.1__7__9_1_sixvf_boinc1894_3
https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=116651922

I apology for that - that task belong to the series of extremely long jobs (10^7 turns) which were submitted with a wrong request of disk space. This task in particular was not killed by my sudden kill announced to the MB:
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5064
The amplitude range covered by your job was outside the range for killing - sorry, I tried to kill as many as possible, trying to minimise the number of upset volunteers...


Longrunner in sixtracktest with 40 hours Cpu and 1.9 PetaFlops successful!.
https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=117930714

That is a new batch of extremely long jobs, with the correct request of disk space. I got one as well and crunched it correctly:
https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=117930647
I asked the user to submit few of these jobs on sixtracktest, just to check that everything was correctly set-up before making (again) a big mess.
I think he can proceed.

Thanks for the feedback, and keep up the good work!
Happy crunching,
A.
ID: 39239 · Report as offensive     Reply Quote
Filipe

Send message
Joined: 9 Aug 05
Posts: 36
Credit: 7,698,293
RAC: 0
Message 39246 - Posted: 3 Jul 2019, 15:19:28 UTC

Aren't those long task more suited to GPU's?

Long running time on CPU's
ID: 39246 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Sixtrack Application : Very long Tasks


©2024 CERN