Message boards : Theory Application : Tasks run 4 days and finish with error
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 8 · Next

AuthorMessage
NOGOOD

Send message
Joined: 18 Nov 17
Posts: 119
Credit: 51,772,832
RAC: 22,157
Message 42240 - Posted: 19 Apr 2020, 20:13:38 UTC - in response to Message 42168.  

Does it mean that there is no more limit of 100 hours of runtime from the project?
Without manipulation the limit is 100 hours, but because of the restart from scratch this 100 hours had a new begin.
See your result: https://lhcathome.cern.ch/lhcathome/result.php?resultid=271085743
2020-04-05 17:07:18 (11100): Status Report: Job Duration: '360000.000000'
2020-04-05 17:07:18 (11100): Status Report: Elapsed Time: '6000.000000'
and 2.5 days later
2020-04-08 08:12:08 (1276): Status Report: Job Duration: '360000.000000'
2020-04-08 08:12:08 (1276): Status Report: Elapsed Time: '6000.065889'


Crystal Pellet, could you please take a look at this task:
https://yadi.sk/i/xTXicptxLPea4A

It is running on PC with "manipulations". Generator is pythia8. Is it still fine?
ID: 42240 · Report as offensive     Reply Quote
NOGOOD

Send message
Joined: 18 Nov 17
Posts: 119
Credit: 51,772,832
RAC: 22,157
Message 42241 - Posted: 19 Apr 2020, 20:19:38 UTC - in response to Message 42240.  

Does it mean that there is no more limit of 100 hours of runtime from the project?
Without manipulation the limit is 100 hours, but because of the restart from scratch this 100 hours had a new begin.
See your result: https://lhcathome.cern.ch/lhcathome/result.php?resultid=271085743
2020-04-05 17:07:18 (11100): Status Report: Job Duration: '360000.000000'
2020-04-05 17:07:18 (11100): Status Report: Elapsed Time: '6000.000000'
and 2.5 days later
2020-04-08 08:12:08 (1276): Status Report: Job Duration: '360000.000000'
2020-04-08 08:12:08 (1276): Status Report: Elapsed Time: '6000.065889'


Crystal Pellet, could you please take a look at this task:
https://yadi.sk/i/xTXicptxLPea4A

It is running on PC with "manipulations". Generator is pythia8. Is it still fine?


It loads CPU all the time running:
https://yadi.sk/i/jOYvQvscaOTYOw

Looks like it is fine.
ID: 42241 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1269
Credit: 8,478,478
RAC: 2,509
Message 42242 - Posted: 20 Apr 2020, 6:05:47 UTC - in response to Message 42241.  

Crystal Pellet, could you please take a look at this task:
https://yadi.sk/i/xTXicptxLPea4A

It is running on PC with "manipulations". Generator is pythia8. Is it still fine?
It loads CPU all the time running:
https://yadi.sk/i/jOYvQvscaOTYOw

Looks like it is fine.
You may also have a look in the VM Console. You need VirtualBox Extention Pack installed for that.
In VM Console you can watch the progress with key combination Alt-F2.
ID: 42242 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1687
Credit: 102,713,706
RAC: 122,535
Message 42243 - Posted: 20 Apr 2020, 6:16:42 UTC - in response to Message 42198.  

Is there anyway we can ask to not be distributed the long sherpa tasks. Ive been keeping an eye on my system and for every good sherpa task I get 5 bad ones that have to be aborted.

Some time ago there was posted here a script that would abort all sherpa tasks if you happened to download one. I just can't find that post now.
it was a post by Crystal Pellet, about mid-February.
Unfortunately, it did not work on my machines, but nevertheless it could work on others.
I do not remember any more in which thread exactly he had postet it.
ID: 42243 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1687
Credit: 102,713,706
RAC: 122,535
Message 42244 - Posted: 20 Apr 2020, 12:28:25 UTC - in response to Message 42243.  

Is there anyway we can ask to not be distributed the long sherpa tasks. Ive been keeping an eye on my system and for every good sherpa task I get 5 bad ones that have to be aborted.

Some time ago there was posted here a script that would abort all sherpa tasks if you happened to download one. I just can't find that post now.
it was a post by Crystal Pellet, about mid-February.
Unfortunately, it did not work on my machines, but nevertheless it could work on others.
I do not remember any more in which thread exactly he had postet it.

I looked for it and found it:

https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5275&postid=41547#41547

much luck! Please let me (us) know if it works for you.
ID: 42244 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 674
Credit: 43,502,196
RAC: 15,953
Message 42245 - Posted: 20 Apr 2020, 17:57:50 UTC - in response to Message 42244.  

It does work for me. Just put the bat-file to the \ProgramData\BOINC\projects\lhcathome.cern.ch_lhcathome directory (normally a hidden directory in Windows) and start it from there.
ID: 42245 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1687
Credit: 102,713,706
RAC: 122,535
Message 42246 - Posted: 20 Apr 2020, 18:44:48 UTC - in response to Message 42245.  

It does work for me. Just put the bat-file to the \ProgramData\BOINC\projects\lhcathome.cern.ch_lhcathome directory (normally a hidden directory in Windows) and start it from there.
Thanks, Harri.
Just one question: can this be done while a Theory task is running, or should there be no running Theory tasks, or should BOINC even be closed?
ID: 42246 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 674
Credit: 43,502,196
RAC: 15,953
Message 42249 - Posted: 21 Apr 2020, 10:20:06 UTC - in response to Message 42246.  
Last modified: 21 Apr 2020, 10:24:10 UTC

It does work for me. Just put the bat-file to the \ProgramData\BOINC\projects\lhcathome.cern.ch_lhcathome directory (normally a hidden directory in Windows) and start it from there.
Thanks, Harri.
Just one question: can this be done while a Theory task is running, or should there be no running Theory tasks, or should BOINC even be closed?

This can be done at any time. That bat-file will actually keep looping and checking all downloaded LHC tasks until it is manually closed. It will abort all downloaded (filename ending with _0, _1 or _2) sherpa tasks that were downloaded. Normally it will abort them before they start.
The loop is every 120 seconds. It looks a bit odd as the screen remains blank until it finds something to abort, it will then report abortion on the screen. So just leave the script running.
ID: 42249 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1269
Credit: 8,478,478
RAC: 2,509
Message 42254 - Posted: 23 Apr 2020, 9:55:28 UTC - in response to Message 42249.  

Longer time ago I rewrote the script to avoid aborting sherpa-tasks (most of them) that probably will be successful and the search is done the other way around to be faster.

@echo off
:START
findstr "run=" *.run >runspecs.txt
setlocal EnableDelayedExpansion

for /f "delims=" %%a in (runspecs.txt) do (
 set str=%%a
 set task=%%a
 set str=!str:~31,-1!
 find "!str!" results0.txt >nul
 if !ERRORLEVEL! equ 0 (
 set task0=Theory_!task:~0,14!_0
 set task1=Theory_!task:~0,14!_1
 set task2=Theory_!task:~0,14!_2
 "C:\Program Files\BOINC\boinccmd" --task https://lhcathome.cern.ch/lhcathome !task0! abort 2>NUL
 "C:\Program Files\BOINC\boinccmd" --task https://lhcathome.cern.ch/lhcathome !task1! abort 2>NUL
 "C:\Program Files\BOINC\boinccmd" --task https://lhcathome.cern.ch/lhcathome !task2! abort 2>NUL
 timeout /t 1 /NOBREAK >NUL
 "C:\Program Files\BOINC\boinccmd" --project https://lhcathome.cern.ch/lhcathome update
 set logstr="!date:~3,10! !time:~0,8!: !task0:~0,-2! run=!str!"
 echo !logstr:~1,-1!
 echo !logstr:~1,-1! >>aborted_tasks.txt
 )
)

endlocal
timeout /t 300 /NOBREAK >NUL
GOTO START

exit

To run this script you have to place the bat-file in the project folder, but also a file called "results0.txt"
You have to fill this file with job descriptions, you think they will not be successful or may run too long.
Those jobs you may find here:
http://mcplots-dev.cern.ch/production.php?view=runs&rev=2378&display=unk for only unknown outcomes and here http://mcplots-dev.cern.ch/production.php?view=runs&rev=2378&display=fail for failed (and unknown) tasks.

Example results0.txt file:
ee zhad 133 - - herwig++ 2.7.1 UE-EE-5-CTEQ6L1
ee zhad 133 - - sherpa 1.2.2p default
ee zhad 133 - - sherpa 1.2.3 default
ee zhad 133 - - sherpa 1.3.0 default
ee zhad 133 - - sherpa 1.3.1 default
ee zhad 133 - - sherpa 1.4.0 default
ee zhad 133 - - sherpa 1.4.1 default

Normally there will be thousands of entries in this file, but you may decide to put only sherpa's in it.
ID: 42254 · Report as offensive     Reply Quote
NOGOOD

Send message
Joined: 18 Nov 17
Posts: 119
Credit: 51,772,832
RAC: 22,157
Message 42255 - Posted: 23 Apr 2020, 21:47:57 UTC

I've got successful result from PC with "manipulations" after almost 10 days of runtime:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=271467702

CPU time is also almost 10 days.
Generator is pythia8.
Seems like limit of 100 hours is not reasonable.
ID: 42255 · Report as offensive     Reply Quote
NOGOOD

Send message
Joined: 18 Nov 17
Posts: 119
Credit: 51,772,832
RAC: 22,157
Message 42257 - Posted: 23 Apr 2020, 23:52:32 UTC - in response to Message 42255.  

I've got successful result from PC with "manipulations" after almost 10 days of runtime:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=271467702

CPU time is also almost 10 days.
Generator is pythia8.
Seems like limit of 100 hours is not reasonable.

Manipulations taking off limit of 100 hours.
ID: 42257 · Report as offensive     Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 418
Credit: 5,667,249
RAC: 7
Message 42262 - Posted: 24 Apr 2020, 12:25:30 UTC - in response to Message 41963.  

Should I just let them run til they fail or should I abort any task with a estimated time of 4 days?
BOINC don't know how long the tasks will run. The 100 hours is just a placeholder to show something, but in fact useless.
Whether a job has real progress, you could show when highlighting a task in BOINC Manager and tick Show Graphics on the left.
You need VirtualBox Extension Pack installed for that.


What should I see with show graphics? I've never looked at that before. I tried my Theory that's been running for 22 hours, and one that's just started, and both got:
"Test4Theory simulations
Waiting for some nice figures to show you.
Please, reload again in a few minutes
Meanwhile you can check the logs"
ID: 42262 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1269
Credit: 8,478,478
RAC: 2,509
Message 42264 - Posted: 24 Apr 2020, 14:17:40 UTC - in response to Message 42262.  

....
Meanwhile you can check the logs"
Check the logs ... or use the button [VM Console] in BOINC Manager to start a remote desktop to see what's going on.
Anyway 'check the logs' means that the event generating part of the job is not started yet and cannot create graphs.
This is mostly happening with 'sherpa' jobs, cause before event generating they have 4 times a pre-processing part of initialization and integration.
ID: 42264 · Report as offensive     Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 418
Credit: 5,667,249
RAC: 7
Message 42267 - Posted: 24 Apr 2020, 16:14:59 UTC - in response to Message 42264.  

Check the logs ... or use the button [VM Console] in BOINC Manager to start a remote desktop to see what's going on.


Looking at VM Console.... The one that's run longer is indeed a Sherpa, but I don't know how to interpret what I see.

Anyway 'check the logs' means that the event generating part of the job is not started yet and cannot create graphs.
This is mostly happening with 'sherpa' jobs, cause before event generating they have 4 times a pre-processing part of initialization and integration.


Even after 22 hours? I'll just let them keep going. Most of my tasks seem to be validated ok on this machine, so I'll let them be.

I did have them running on some slower computers aswell, and got a lot of errors on anything but sixtrack. Is there a minimum requirement for running Virtualbox WUs? They seemed to run ok, but over 50% failed. The slower machines have 8GB RAM, but ancient CPUs (as in DDR2 and DDR3 era).
ID: 42267 · Report as offensive     Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 418
Credit: 5,667,249
RAC: 7
Message 42268 - Posted: 24 Apr 2020, 16:25:56 UTC - in response to Message 42264.  
Last modified: 24 Apr 2020, 16:27:10 UTC

....
Meanwhile you can check the logs"
Check the logs ... or use the button [VM Console] in BOINC Manager to start a remote desktop to see what's going on.
Anyway 'check the logs' means that the event generating part of the job is not started yet and cannot create graphs.
This is mostly happening with 'sherpa' jobs, cause before event generating they have 4 times a pre-processing part of initialization and integration.


Sherpa log ends in "integration time: ( 20h 54m 18s elapsed / 2316d 6h 59m 20s left )"
That's 7 years remaining.

Is there a way of me telling if it's going to be a fruitless effort? If so, can they not abort themselves?
ID: 42268 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1269
Credit: 8,478,478
RAC: 2,509
Message 42269 - Posted: 24 Apr 2020, 17:17:17 UTC - in response to Message 42268.  

Sherpa log ends in "integration time: ( 20h 54m 18s elapsed / 2316d 6h 59m 20s left )"
That's 7 years remaining.

Is there a way of me telling if it's going to be a fruitless effort? If so, can they not abort themselves?

The only way we volunteers have is to compare the job description (1st line of running.log - Show Graphics or VM Console ALT-F1) with the Failed jobs list.
If it is in the list, there was no success so far, what however does not mean that it will never succeed.
It's up to you what to do with such a task: Abort or give it a try.
ID: 42269 · Report as offensive     Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 418
Credit: 5,667,249
RAC: 7
Message 42270 - Posted: 24 Apr 2020, 18:33:45 UTC - in response to Message 42269.  

The only way we volunteers have is to compare the job description (1st line of running.log - Show Graphics or VM Console ALT-F1) with the Failed jobs list.
If it is in the list, there was no success so far, what however does not mean that it will never succeed.
It's up to you what to do with such a task: Abort or give it a try.


I'll let it do what the project team want it to do. If they think it should keep trying for 4 days, then that's what it gets.

I assume Virtualbox is ok with it being suspended a lot? As the machine I run them on does Rosetta aswell and Boinc tends to switch about a lot. I have Boinc set to switch between applications every 60 minutes. Not sure why it's switching so much, as I have 4 cores allocated to Boinc, with Rosetta set to 3 times the weight of LHC, so you'd think it would just leave 1 core permanently on LHC and 3 on Rosetta.
ID: 42270 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 674
Credit: 43,502,196
RAC: 15,953
Message 42272 - Posted: 24 Apr 2020, 21:45:14 UTC - in response to Message 42270.  

I assume Virtualbox is ok with it being suspended a lot? As the machine I run them on does Rosetta aswell and Boinc tends to switch about a lot. I have Boinc set to switch between applications every 60 minutes. Not sure why it's switching so much, as I have 4 cores allocated to Boinc, with Rosetta set to 3 times the weight of LHC, so you'd think it would just leave 1 core permanently on LHC and 3 on Rosetta.


Actually some of the VirtualBox tasks do not allow suspension at all. They will just start from the beginning again. I think that Atlas and Theory tasks aren't so picky for short suspensions. Remember to have selected in your computing preferences 'Leave non-GPU tasks in memory while suspended'. Suspended tasks will then have a better chance to finish successfully.
ID: 42272 · Report as offensive     Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 418
Credit: 5,667,249
RAC: 7
Message 42274 - Posted: 24 Apr 2020, 22:01:07 UTC - in response to Message 42272.  
Last modified: 24 Apr 2020, 22:06:46 UTC

I assume Virtualbox is ok with it being suspended a lot? As the machine I run them on does Rosetta aswell and Boinc tends to switch about a lot. I have Boinc set to switch between applications every 60 minutes. Not sure why it's switching so much, as I have 4 cores allocated to Boinc, with Rosetta set to 3 times the weight of LHC, so you'd think it would just leave 1 core permanently on LHC and 3 on Rosetta.


Actually some of the VirtualBox tasks do not allow suspension at all. They will just start from the beginning again. I think that Atlas and Theory tasks aren't so picky for short suspensions. Remember to have selected in your computing preferences 'Leave non-GPU tasks in memory while suspended'. Suspended tasks will then have a better chance to finish successfully.


I had that unticked so I didn't run out of RAM, but I guess Windows can always page it.

Why does it only have the option for non-GPU tasks? Because Windows can't page those?

I see the manual says "If checked, suspended tasks stay in memory, and resume with no work lost. If unchecked, suspended tasks are removed from memory, and resume from their last checkpoint." Is this the default? I can't remember. And it should mean if it's unchecked you just lose a small amount of work, but it shouldn't break the task.
ID: 42274 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 674
Credit: 43,502,196
RAC: 15,953
Message 42283 - Posted: 25 Apr 2020, 21:39:53 UTC - in response to Message 42274.  

I think the GPU apps cannot keep in memory what was happening in GPU if the calculation is interrupted or at least they could not utilize it correctly and it caused some problems. So GPU's memory is released if computation is suspended.

The Virtualbox applications have double saving system: They checkpoint in Boinc and save the virtual machine status in VirtualBox Manager but I don't know if or how the Boinc checkpoint transfers to the Virtualbox when calculation resumes. I know that if you have several VirtualBox machines running from an traditional hard drive and you start or stop Boinc you can swamp the hard drive I/O and some of the Virtual Machines can fail and have unrecoverable error. SSD's can handle this disk I/O better.
ID: 42283 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 8 · Next

Message boards : Theory Application : Tasks run 4 days and finish with error


©2024 CERN