Message boards :
Theory Application :
Tasks run 4 days and finish with error
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 8 · Next
Author | Message |
---|---|
Send message Joined: 18 Nov 17 Posts: 124 Credit: 52,221,300 RAC: 26,637 |
Does it mean that there is no more limit of 100 hours of runtime from the project?Without manipulation the limit is 100 hours, but because of the restart from scratch this 100 hours had a new begin. Crystal Pellet, could you please take a look at this task: https://yadi.sk/i/xTXicptxLPea4A It is running on PC with "manipulations". Generator is pythia8. Is it still fine? |
Send message Joined: 18 Nov 17 Posts: 124 Credit: 52,221,300 RAC: 26,637 |
Does it mean that there is no more limit of 100 hours of runtime from the project?Without manipulation the limit is 100 hours, but because of the restart from scratch this 100 hours had a new begin. It loads CPU all the time running: https://yadi.sk/i/jOYvQvscaOTYOw Looks like it is fine. |
Send message Joined: 14 Jan 10 Posts: 1286 Credit: 8,515,710 RAC: 2,852 |
You may also have a look in the VM Console. You need VirtualBox Extention Pack installed for that.Crystal Pellet, could you please take a look at this task:It loads CPU all the time running: In VM Console you can watch the progress with key combination Alt-F2. |
Send message Joined: 18 Dec 15 Posts: 1691 Credit: 104,594,022 RAC: 116,425 |
it was a post by Crystal Pellet, about mid-February.Is there anyway we can ask to not be distributed the long sherpa tasks. Ive been keeping an eye on my system and for every good sherpa task I get 5 bad ones that have to be aborted. Unfortunately, it did not work on my machines, but nevertheless it could work on others. I do not remember any more in which thread exactly he had postet it. |
Send message Joined: 18 Dec 15 Posts: 1691 Credit: 104,594,022 RAC: 116,425 |
it was a post by Crystal Pellet, about mid-February.Is there anyway we can ask to not be distributed the long sherpa tasks. Ive been keeping an eye on my system and for every good sherpa task I get 5 bad ones that have to be aborted. I looked for it and found it: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5275&postid=41547#41547 much luck! Please let me (us) know if it works for you. |
Send message Joined: 28 Sep 04 Posts: 677 Credit: 43,746,821 RAC: 15,199 |
It does work for me. Just put the bat-file to the \ProgramData\BOINC\projects\lhcathome.cern.ch_lhcathome directory (normally a hidden directory in Windows) and start it from there. |
Send message Joined: 18 Dec 15 Posts: 1691 Credit: 104,594,022 RAC: 116,425 |
It does work for me. Just put the bat-file to the \ProgramData\BOINC\projects\lhcathome.cern.ch_lhcathome directory (normally a hidden directory in Windows) and start it from there.Thanks, Harri. Just one question: can this be done while a Theory task is running, or should there be no running Theory tasks, or should BOINC even be closed? |
Send message Joined: 28 Sep 04 Posts: 677 Credit: 43,746,821 RAC: 15,199 |
It does work for me. Just put the bat-file to the \ProgramData\BOINC\projects\lhcathome.cern.ch_lhcathome directory (normally a hidden directory in Windows) and start it from there.Thanks, Harri. This can be done at any time. That bat-file will actually keep looping and checking all downloaded LHC tasks until it is manually closed. It will abort all downloaded (filename ending with _0, _1 or _2) sherpa tasks that were downloaded. Normally it will abort them before they start. The loop is every 120 seconds. It looks a bit odd as the screen remains blank until it finds something to abort, it will then report abortion on the screen. So just leave the script running. |
Send message Joined: 14 Jan 10 Posts: 1286 Credit: 8,515,710 RAC: 2,852 |
Longer time ago I rewrote the script to avoid aborting sherpa-tasks (most of them) that probably will be successful and the search is done the other way around to be faster. @echo off :START findstr "run=" *.run >runspecs.txt setlocal EnableDelayedExpansion for /f "delims=" %%a in (runspecs.txt) do ( set str=%%a set task=%%a set str=!str:~31,-1! find "!str!" results0.txt >nul if !ERRORLEVEL! equ 0 ( set task0=Theory_!task:~0,14!_0 set task1=Theory_!task:~0,14!_1 set task2=Theory_!task:~0,14!_2 "C:\Program Files\BOINC\boinccmd" --task https://lhcathome.cern.ch/lhcathome !task0! abort 2>NUL "C:\Program Files\BOINC\boinccmd" --task https://lhcathome.cern.ch/lhcathome !task1! abort 2>NUL "C:\Program Files\BOINC\boinccmd" --task https://lhcathome.cern.ch/lhcathome !task2! abort 2>NUL timeout /t 1 /NOBREAK >NUL "C:\Program Files\BOINC\boinccmd" --project https://lhcathome.cern.ch/lhcathome update set logstr="!date:~3,10! !time:~0,8!: !task0:~0,-2! run=!str!" echo !logstr:~1,-1! echo !logstr:~1,-1! >>aborted_tasks.txt ) ) endlocal timeout /t 300 /NOBREAK >NUL GOTO START exit To run this script you have to place the bat-file in the project folder, but also a file called "results0.txt" You have to fill this file with job descriptions, you think they will not be successful or may run too long. Those jobs you may find here: http://mcplots-dev.cern.ch/production.php?view=runs&rev=2378&display=unk for only unknown outcomes and here http://mcplots-dev.cern.ch/production.php?view=runs&rev=2378&display=fail for failed (and unknown) tasks. Example results0.txt file: ee zhad 133 - - herwig++ 2.7.1 UE-EE-5-CTEQ6L1 ee zhad 133 - - sherpa 1.2.2p default ee zhad 133 - - sherpa 1.2.3 default ee zhad 133 - - sherpa 1.3.0 default ee zhad 133 - - sherpa 1.3.1 default ee zhad 133 - - sherpa 1.4.0 default ee zhad 133 - - sherpa 1.4.1 default Normally there will be thousands of entries in this file, but you may decide to put only sherpa's in it. |
Send message Joined: 18 Nov 17 Posts: 124 Credit: 52,221,300 RAC: 26,637 |
I've got successful result from PC with "manipulations" after almost 10 days of runtime: https://lhcathome.cern.ch/lhcathome/result.php?resultid=271467702 CPU time is also almost 10 days. Generator is pythia8. Seems like limit of 100 hours is not reasonable. |
Send message Joined: 18 Nov 17 Posts: 124 Credit: 52,221,300 RAC: 26,637 |
I've got successful result from PC with "manipulations" after almost 10 days of runtime: Manipulations taking off limit of 100 hours. |
Send message Joined: 12 Aug 06 Posts: 418 Credit: 5,667,249 RAC: 1 |
Should I just let them run til they fail or should I abort any task with a estimated time of 4 days?BOINC don't know how long the tasks will run. The 100 hours is just a placeholder to show something, but in fact useless. What should I see with show graphics? I've never looked at that before. I tried my Theory that's been running for 22 hours, and one that's just started, and both got: "Test4Theory simulations Waiting for some nice figures to show you. Please, reload again in a few minutes Meanwhile you can check the logs" |
Send message Joined: 14 Jan 10 Posts: 1286 Credit: 8,515,710 RAC: 2,852 |
....Check the logs ... or use the button [VM Console] in BOINC Manager to start a remote desktop to see what's going on. Anyway 'check the logs' means that the event generating part of the job is not started yet and cannot create graphs. This is mostly happening with 'sherpa' jobs, cause before event generating they have 4 times a pre-processing part of initialization and integration. |
Send message Joined: 12 Aug 06 Posts: 418 Credit: 5,667,249 RAC: 1 |
Check the logs ... or use the button [VM Console] in BOINC Manager to start a remote desktop to see what's going on. Looking at VM Console.... The one that's run longer is indeed a Sherpa, but I don't know how to interpret what I see. Anyway 'check the logs' means that the event generating part of the job is not started yet and cannot create graphs. Even after 22 hours? I'll just let them keep going. Most of my tasks seem to be validated ok on this machine, so I'll let them be. I did have them running on some slower computers aswell, and got a lot of errors on anything but sixtrack. Is there a minimum requirement for running Virtualbox WUs? They seemed to run ok, but over 50% failed. The slower machines have 8GB RAM, but ancient CPUs (as in DDR2 and DDR3 era). |
Send message Joined: 12 Aug 06 Posts: 418 Credit: 5,667,249 RAC: 1 |
....Check the logs ... or use the button [VM Console] in BOINC Manager to start a remote desktop to see what's going on. Sherpa log ends in "integration time: ( 20h 54m 18s elapsed / 2316d 6h 59m 20s left )" That's 7 years remaining. Is there a way of me telling if it's going to be a fruitless effort? If so, can they not abort themselves? |
Send message Joined: 14 Jan 10 Posts: 1286 Credit: 8,515,710 RAC: 2,852 |
Sherpa log ends in "integration time: ( 20h 54m 18s elapsed / 2316d 6h 59m 20s left )" The only way we volunteers have is to compare the job description (1st line of running.log - Show Graphics or VM Console ALT-F1) with the Failed jobs list. If it is in the list, there was no success so far, what however does not mean that it will never succeed. It's up to you what to do with such a task: Abort or give it a try. |
Send message Joined: 12 Aug 06 Posts: 418 Credit: 5,667,249 RAC: 1 |
The only way we volunteers have is to compare the job description (1st line of running.log - Show Graphics or VM Console ALT-F1) with the Failed jobs list. I'll let it do what the project team want it to do. If they think it should keep trying for 4 days, then that's what it gets. I assume Virtualbox is ok with it being suspended a lot? As the machine I run them on does Rosetta aswell and Boinc tends to switch about a lot. I have Boinc set to switch between applications every 60 minutes. Not sure why it's switching so much, as I have 4 cores allocated to Boinc, with Rosetta set to 3 times the weight of LHC, so you'd think it would just leave 1 core permanently on LHC and 3 on Rosetta. |
Send message Joined: 28 Sep 04 Posts: 677 Credit: 43,746,821 RAC: 15,199 |
I assume Virtualbox is ok with it being suspended a lot? As the machine I run them on does Rosetta aswell and Boinc tends to switch about a lot. I have Boinc set to switch between applications every 60 minutes. Not sure why it's switching so much, as I have 4 cores allocated to Boinc, with Rosetta set to 3 times the weight of LHC, so you'd think it would just leave 1 core permanently on LHC and 3 on Rosetta. Actually some of the VirtualBox tasks do not allow suspension at all. They will just start from the beginning again. I think that Atlas and Theory tasks aren't so picky for short suspensions. Remember to have selected in your computing preferences 'Leave non-GPU tasks in memory while suspended'. Suspended tasks will then have a better chance to finish successfully. |
Send message Joined: 12 Aug 06 Posts: 418 Credit: 5,667,249 RAC: 1 |
I assume Virtualbox is ok with it being suspended a lot? As the machine I run them on does Rosetta aswell and Boinc tends to switch about a lot. I have Boinc set to switch between applications every 60 minutes. Not sure why it's switching so much, as I have 4 cores allocated to Boinc, with Rosetta set to 3 times the weight of LHC, so you'd think it would just leave 1 core permanently on LHC and 3 on Rosetta. I had that unticked so I didn't run out of RAM, but I guess Windows can always page it. Why does it only have the option for non-GPU tasks? Because Windows can't page those? I see the manual says "If checked, suspended tasks stay in memory, and resume with no work lost. If unchecked, suspended tasks are removed from memory, and resume from their last checkpoint." Is this the default? I can't remember. And it should mean if it's unchecked you just lose a small amount of work, but it shouldn't break the task. |
Send message Joined: 28 Sep 04 Posts: 677 Credit: 43,746,821 RAC: 15,199 |
I think the GPU apps cannot keep in memory what was happening in GPU if the calculation is interrupted or at least they could not utilize it correctly and it caused some problems. So GPU's memory is released if computation is suspended. The Virtualbox applications have double saving system: They checkpoint in Boinc and save the virtual machine status in VirtualBox Manager but I don't know if or how the Boinc checkpoint transfers to the Virtualbox when calculation resumes. I know that if you have several VirtualBox machines running from an traditional hard drive and you start or stop Boinc you can swamp the hard drive I/O and some of the Virtual Machines can fail and have unrecoverable error. SSD's can handle this disk I/O better. |
©2024 CERN