Message boards : Number crunching : Tasks "completing" at random percantages
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile MechaToaster
Avatar

Send message
Joined: 17 Aug 17
Posts: 15
Credit: 179,253
RAC: 0
Message 36931 - Posted: 1 Oct 2018, 11:22:59 UTC

i apologize if this is the wrong place to ask or if this is a common thing; ive been away from these projects for a year and i forgot a lot of what is normal and not.
im having this problem(?) of some tasks, theory/ATLAS/LHCb, going fine for anywhere from 10% done to 50% to 75% then disappearing from the BOINC manager task window. when i check my tasks on this website, it shows them as completed and successfully validated with no errors(that i can see).
is this normal? the most recent one happened just now; a theory job that stopped after maybe an hour or so of work. i was actually looking at the task window while eating breakfast and it just disappeared; didnt pause or say ready to submit or anything, just vanished. the task is:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=207090669
https://lhcathome.cern.ch/lhcathome/result.php?resultid=207090946 (a different one than described above, but same scenario)

whether or not an error occured here, it does seem to be different from other errors ive had. im on a new PC than what i was on before and ive been getting a lot more errors on this one than i did on my old one. the other tasks that fail always show an error message or appear on the website under "error" or "invalid". i thought it might be related to clockspeeds/unstable overclocks; i remember back on my old pc(overclocked FX-8350), i was told its better to run at stock as an unstable overclock could potentially return incorrect values and basically waste everyones time.
i tried several settings on my new PC(AMD ryzen 5 1600x): stock clockspeed/voltage with the CPUs built in boosting, a small OC(tested for stability), and full stock settings with no boost at all, but the issue persists. i have AMD virtualization enabled, ive checked memory for errors and everything seems fine? ive not run into many hardware issues on this build so i guess im a bit confused as to where i should start with troubleshooting this.
one last thing im curious about: does clock speed matter while a task is in progress? the power plan on windows 10 and ryzen PCs is a little weird and sometimes my clockspeed will fluctuate, going from maybe 3.6ghz down to 3.4ghz. should i take steps to prevent that from happening?
thanks and sorry for the wall of text.
ID: 36931 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 36932 - Posted: 1 Oct 2018, 14:49:52 UTC - in response to Message 36931.  

The random percentage at completion numbers are normal. The grossly inaccurate time remaining numbers are also normal.

Tasks just disappearing with no "uploading" or "ready to submit" status is also normal. The reason is all the uploading happens inside the VBox.

Abnormal... tasks that run for about 1,100 secs. Normally Theory and LHCb tasks run for a minimum of 12 hours and a maximum of 18. Between the 12 and 18 the task downloads and completes a number of jobs (sub-tasks) under Condor. For the few weeks Condor has, on occasion, been unable to send jobs. If a task doesn't get a job within ~1,100 seconds of starting then it quits and you get "Error while computing" in your list of tasks. In the stderr output for such tasks you'll see the "EXIT_NO_SUB_TASKS" error currently under discussion on the Test4Theory message board. This is not caused by anything on your end, shifting clock speed for example. It is entirely a problem on their end.
ID: 36932 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 991
Credit: 6,426,616
RAC: 480
Message 36933 - Posted: 1 Oct 2018, 15:04:30 UTC - in response to Message 36931.  

To extend bronco's answer:

Theory and LHCb both have running sub-jobs within the virtual machine. When a sub-job ends after the 12 hour mark it will end successful and no new sub-jobs are requested.
Every now and then there are no sub-jobs available. When this happens with the first sub-job requested your task will generate an error.
When this happens when you have done one sub-job or more the VM will be killed gracefully after 10 minutes and you will return a valid task, but with a shorter run time than the 12 hours.

ATLAS runs only 1 job and get the input on your host, which is used in a shared directory with the VM.
Also the (big) result file is uploaded from your host. The run times depend on the type of job and how many cores you are using for the task.
ID: 36933 · Report as offensive     Reply Quote
Profile MechaToaster
Avatar

Send message
Joined: 17 Aug 17
Posts: 15
Credit: 179,253
RAC: 0
Message 36934 - Posted: 1 Oct 2018, 16:56:37 UTC - in response to Message 36932.  

thanks guys, good to know. didnt wanna have to tweak a million settings to get things working.
ID: 36934 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 36935 - Posted: 1 Oct 2018, 18:02:58 UTC - in response to Message 36934.  

Looking through the stderr output of a few of your LHCb and Theory tasks, I would say you pretty much have the settings tweaked properly. The ideal scenario is that they run from start to finish uninterrupted which means no suspensions due to CPU busy, no preemptions by other projects, no shutdown for OS updates, etc.

Several of your tasks showed 1 start and 1 stop. If you can run them all that way you'll have a near 100% success rate. A few showed 2 starts which is very good compared to some other users' tasks I see that are restarting more than 20 times. Obviously a restart is not as guarantee of a failed task but the more restarts the higher the likelihood the task will fail.

From the length of time between the stops and restarts I would guess maybe your tasks are being preempted by other projects. In that case you might consider boosting the "switch between tasks every __ minutes" to just over 1085 which is 5 minutes greater than the max LHCb/Theory task length of 18 hours.
ID: 36935 · Report as offensive     Reply Quote
Profile MechaToaster
Avatar

Send message
Joined: 17 Aug 17
Posts: 15
Credit: 179,253
RAC: 0
Message 37019 - Posted: 14 Oct 2018, 15:08:24 UTC - in response to Message 36935.  

Looking through the stderr output of a few of your LHCb and Theory tasks, I would say you pretty much have the settings tweaked properly. The ideal scenario is that they run from start to finish uninterrupted which means no suspensions due to CPU busy, no preemptions by other projects, no shutdown for OS updates, etc.

Several of your tasks showed 1 start and 1 stop. If you can run them all that way you'll have a near 100% success rate. A few showed 2 starts which is very good compared to some other users' tasks I see that are restarting more than 20 times. Obviously a restart is not as guarantee of a failed task but the more restarts the higher the likelihood the task will fail.

From the length of time between the stops and restarts I would guess maybe your tasks are being preempted by other projects. In that case you might consider boosting the "switch between tasks every __ minutes" to just over 1085 which is 5 minutes greater than the max LHCb/Theory task length of 18 hours.


what do you mean by my tasks are being preempted by other projects? i am only running LHC@home. i ran another thing on my phone before but ive always stuck to just LHC@home; i have one PC and only 16gb of memory so i cannot do a lot of things at once.
ID: 37019 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 37021 - Posted: 14 Oct 2018, 17:50:50 UTC - in response to Message 37019.  

Deplorable misquote. LOL
ID: 37021 · Report as offensive     Reply Quote

Message boards : Number crunching : Tasks "completing" at random percantages


©2021 CERN