Message boards : ATLAS application : Atlas Simulation 1.01 (Vbox64) will not finish
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
greg_be

Send message
Joined: 28 Dec 08
Posts: 289
Credit: 2,059,226
RAC: 1,530
Message 38174 - Posted: 9 Mar 2019, 8:09:32 UTC - in response to Message 38172.  

To all the rest a general comment: I just got 7 new tasks. Run time is set for 3hrs between tasks as I see they are supposed to complete in 3:25 roughly.
If you got the 3:25 from the Remaining time in BOINC Manager then it's likely not very accurate. Yeah, I know, it shouldn't be that way but that's the way it is. The % complete figure is pretty much useless too. I strongly suggest boosting the switch between tasks time to 10 hours or more until you get a better idea of how long the tasks actually take, otherwise you're setting yourself up for more failed tasks.



Changed to 4hrs between switching.
That should cover these tasks, will see what I get after I am done with the 3:25 stuff.
ID: 38174 · Report as offensive     Reply Quote
greg_be

Send message
Joined: 28 Dec 08
Posts: 289
Credit: 2,059,226
RAC: 1,530
Message 38175 - Posted: 9 Mar 2019, 8:14:34 UTC - in response to Message 38173.  
Last modified: 9 Mar 2019, 8:20:56 UTC

UAM, you are talking about stuff I have no idea how to do or where to find it.
VM and VM extensions are up to date. That's all I know about other than opening VM console on its own.
Never have seen anything about using BOINC to open VM.

You can access the Consoles and the graphics/ log-files with BOINC Manager.
Highlight a running task and press the button on the left from the column "Commands " Show VM Console or Show graphics.


Nope.
Show VM takes me to a black screen to log into their system or something like that.
Need a username and password.

Graphics, that takes me to the homepage of CERN.


Now, when I open up VM via Windows, then I can see logs and so forth.
No graphics.
Since the task just started, there is nothing unusual to report.
Just the usual setup tasks and starting of the computing.
ID: 38175 · Report as offensive     Reply Quote
greg_be

Send message
Joined: 28 Dec 08
Posts: 289
Credit: 2,059,226
RAC: 1,530
Message 38178 - Posted: 9 Mar 2019, 12:00:32 UTC

4hrs not long enough. Looks more like 6. (68.874% done)
1:38 remaining still.
BOINC has put it into waiting to run status and moved on.
4hrs from now it will come back.
Check pointing is set for every 60 seconds by the way.
ID: 38178 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 38180 - Posted: 9 Mar 2019, 13:52:17 UTC - in response to Message 38178.  

4hrs not long enough. Looks more like 6. (68.874% done)
1:38 remaining still.
BOINC has put it into waiting to run status and moved on.
4hrs from now it will come back.
Check pointing is set for every 60 seconds by the way.


Ignore the percent done because it is BS. AGAIN... it should not be that way but it is.
Focus on this fact and this fact alone... ATLAS tasks run until they process 200 events. When the task resumes the event counter will go back to 0 and it will attempt to process 200 events again. I can almost guarantee you 6 hours will not be enough. In fact maybe even 10 hours won't. So then what's going to happen? Well, it will suspend again and when it resumes the event counter will reset to 0. Yep, you should have set it to 10 or maybe even more.
ID: 38180 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 576
Credit: 18,029,565
RAC: 23,105
Message 38181 - Posted: 9 Mar 2019, 15:15:43 UTC

It seems that native ATLAS does not have this problem. I run three WUs at a time, with two cores per WU on my i7-4770 (Ubuntu 16.04).
As a test, I suspended them for one minute, and then resumed them. They picked up with no problem where they left off.

I don't know why VBox has this problem, but it is not inherent to ATLAS (I have LAIM enabled, if that matters).
ID: 38181 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 38182 - Posted: 9 Mar 2019, 15:39:06 UTC - in response to Message 38181.  
Last modified: 9 Mar 2019, 16:28:55 UTC

It seems that native ATLAS does not have this problem. I run three WUs at a time, with two cores per WU on my i7-4770 (Ubuntu 16.04).
As a test, I suspended them for one minute, and then resumed them. They picked up with no problem where they left off.

I don't know why VBox has this problem, but it is not inherent to ATLAS (I have LAIM enabled, if that matters).
I've found that if suspended for just 1 minute then native ATLAS will frequently resume from where they left off but not always. When suspended for more than a few minutes the event counter resets to 0. You might not notice that reset in the console but if you track the events by parsing them out of the log files buried deep within subfolders in the shared folder you can see that the event counter definitely resets (if suspended long enough). I suspect it has little to do with LAIM and more to do with the CVMFS cache or something?
ID: 38182 · Report as offensive     Reply Quote
greg_be

Send message
Joined: 28 Dec 08
Posts: 289
Credit: 2,059,226
RAC: 1,530
Message 38183 - Posted: 9 Mar 2019, 18:07:25 UTC - in response to Message 38180.  
Last modified: 9 Mar 2019, 18:21:50 UTC

4hrs not long enough. Looks more like 6. (68.874% done)
1:38 remaining still.
BOINC has put it into waiting to run status and moved on.
4hrs from now it will come back.
Check pointing is set for every 60 seconds by the way.


Ignore the percent done because it is BS. AGAIN... it should not be that way but it is.
Focus on this fact and this fact alone... ATLAS tasks run until they process 200 events. When the task resumes the event counter will go back to 0 and it will attempt to process 200 events again. I can almost guarantee you 6 hours will not be enough. In fact maybe even 10 hours won't. So then what's going to happen? Well, it will suspend again and when it resumes the event counter will reset to 0. Yep, you should have set it to 10 or maybe even more.



Interesting. Well I will have to take a look at that.
Maybe for now I suspend all non ATLAS cpu tasks and hack off what I have in queue. BOINC has not let it resume yet.

Is there anywhere to see how many events it has done?

Note: Now that ATLAS has resumed, the % done is climbing steadily like .002-.003% per second. But remaining time clicks off 1 second every 2-3 seconds.
ID: 38183 · Report as offensive     Reply Quote
greg_be

Send message
Joined: 28 Dec 08
Posts: 289
Credit: 2,059,226
RAC: 1,530
Message 38184 - Posted: 9 Mar 2019, 18:13:29 UTC
Last modified: 9 Mar 2019, 18:16:41 UTC

Put all other CPU projects in suspend mode. Going to let ATLAS grind everything on its own as it wants.

In the process of suspending everything I forgot to tell LHC host to not send any new work, so I picked up some theory and CMS stuff. But those should be the last things to process. Right now I have a total of 7 ATLAS tasks. 6 in waiting and one processing.
ID: 38184 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 38185 - Posted: 9 Mar 2019, 18:58:32 UTC - in response to Message 38183.  

Is there anywhere to see how many events it has done?
I don't run ATLAS VBox anymore just ATLAS native so I can't say for sure but other responders in this thread seem to suggest there is a way. Perhaps review their suggestions?

Note: Now that ATLAS has resumed, the % done is climbing steadily like .002-.003% per second. But remaining time clicks off 1 second every 2-3 seconds.
That's the way they work and it's because the % done and remaining time are not calculated from the number of events processed. The numbers would be more accurate if they were calculated that way but it seems BOINC has no facility for doing so and probably never will.
ID: 38185 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 38186 - Posted: 9 Mar 2019, 19:23:18 UTC - in response to Message 38184.  

...so I picked up some theory and CMS stuff. But those should be the last things to process.
I believe that's the way it's supposed to work but don't count on it. The only way to guarantee it is to suspend them, AFAIK.
Anyway, step-by-step you are slowly discovering what others have learned already and have mentioned in this thread... running ATLAS tasks alongside other types of LHC tasks and/or tasks from other projects requires micro-managing. You'll get fairly decent success rate if you micro-manage well but the optimum configuration is to run ATLAS all by itself. Yeah, I know, then you need at least 2 hosts if you want to participate in other projects and it ain't supposed to work that way but...

Right now I have a total of 7 ATLAS tasks. 6 in waiting and one processing.
Remember each one of those ATLAS tasks requires a d/l of ~300 MB to ~400 MB. If things go wrong and your host can't crunch them before deadline then they get cancelled and you've wasted a good chunk of your monthly download limit (assuming your ISP has such a limit). You might want to adjust your LHC@home prefs such that you cache fewer ATLAS tasks.
ID: 38186 · Report as offensive     Reply Quote
Profile Ray Murray
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 256
Credit: 11,287,889
RAC: 2
Message 38187 - Posted: 9 Mar 2019, 19:43:12 UTC - in response to Message 38185.  
Last modified: 9 Mar 2019, 19:43:38 UTC

Is there anywhere to see how many events it has done?

Show Console then ALT-F2 will show events and average times per event. Don't know how many times it does each group of events before moving on but the percentage progress isn't far away from the events/200 value.
ID: 38187 · Report as offensive     Reply Quote
greg_be

Send message
Joined: 28 Dec 08
Posts: 289
Credit: 2,059,226
RAC: 1,530
Message 38188 - Posted: 9 Mar 2019, 20:10:11 UTC - in response to Message 38186.  
Last modified: 9 Mar 2019, 20:25:44 UTC

**Bronco**- No download limit. I am in Europe and have unlimited DSL for TV and Computer.
Only my 4G is limited by the type of contract I use, but there are plenty of free wifi hubs that I have access to.
ID: 38188 · Report as offensive     Reply Quote
greg_be

Send message
Joined: 28 Dec 08
Posts: 289
Credit: 2,059,226
RAC: 1,530
Message 38189 - Posted: 9 Mar 2019, 20:18:03 UTC - in response to Message 38187.  

Is there anywhere to see how many events it has done?

Show Console then ALT-F2 will show events and average times per event. Don't know how many times it does each group of events before moving on but the percentage progress isn't far away from the events/200 value.


Interesting, how do you scroll back up?
And yes it seems to repeat events.
Several numbers keep getting repeated.
ID: 38189 · Report as offensive     Reply Quote
greg_be

Send message
Joined: 28 Dec 08
Posts: 289
Credit: 2,059,226
RAC: 1,530
Message 38190 - Posted: 9 Mar 2019, 20:23:46 UTC - in response to Message 38186.  

...so I picked up some theory and CMS stuff. But those should be the last things to process.
I believe that's the way it's supposed to work but don't count on it. The only way to guarantee it is to suspend them, AFAIK.
Anyway, step-by-step you are slowly discovering what others have learned already and have mentioned in this thread... running ATLAS tasks alongside other types of LHC tasks and/or tasks from other projects requires micro-managing. You'll get fairly decent success rate if you micro-manage well but the optimum configuration is to run ATLAS all by itself. Yeah, I know, then you need at least 2 hosts if you want to participate in other projects and it ain't supposed to work that way but...

Right now I have a total of 7 ATLAS tasks. 6 in waiting and one processing.
Remember each one of those ATLAS tasks requires a d/l of ~300 MB to ~400 MB. If things go wrong and your host can't crunch them before deadline then they get cancelled and you've wasted a good chunk of your monthly download limit (assuming your ISP has such a limit). You might want to adjust your LHC@home prefs such that you cache fewer ATLAS tasks.



From what I am seeing ATLAS time remaining decreases by 1 second every 3 real time seconds.
I reduced the total amount of tasks allowed by LHC in general to 5 down from unlimited.
And there is now way I can afford the electric bill of two physical hosts, well at least not until we see if we have money for solar panels after the rest of the long term house renovation is done. Now if there were some micro solar panels with a battery pack that could power my computer that would be a big plus. Then maybe I could dedicate a second host to exclusive ATLAS tasks. For now, I just have to learn how it functions and micro manage it day by day.
ID: 38190 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1826
Credit: 123,805,392
RAC: 86,977
Message 38191 - Posted: 9 Mar 2019, 20:25:42 UTC - in response to Message 38189.  

Interesting, how do you scroll back up?

You can't.


And yes it seems to repeat events.
Several numbers keep getting repeated.

You may read this explanation:
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4965&postid=38135
ID: 38191 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1826
Credit: 123,805,392
RAC: 86,977
Message 38192 - Posted: 9 Mar 2019, 20:31:27 UTC - in response to Message 38190.  

... Then maybe I could dedicate a second host to exclusive ATLAS tasks.

Why not just a second BOINC client on the same host?
There are lot's of howtos around.
To find them you may ask your favorite search engine.
ID: 38192 · Report as offensive     Reply Quote
greg_be

Send message
Joined: 28 Dec 08
Posts: 289
Credit: 2,059,226
RAC: 1,530
Message 38193 - Posted: 9 Mar 2019, 21:22:04 UTC - in response to Message 38192.  
Last modified: 9 Mar 2019, 21:31:03 UTC

... Then maybe I could dedicate a second host to exclusive ATLAS tasks.

Why not just a second BOINC client on the same host?
There are lot's of howtos around.
To find them you may ask your favorite search engine.


Interesting..but wouldn't that cause a conflict of resources (cpu) with ATLAS needing all the cores I allow BOINC to have and the other projects also wanting to use the total amount of cores I have allocated at the same time?
ID: 38193 · Report as offensive     Reply Quote
greg_be

Send message
Joined: 28 Dec 08
Posts: 289
Credit: 2,059,226
RAC: 1,530
Message 38194 - Posted: 9 Mar 2019, 21:26:13 UTC - in response to Message 38191.  

Interesting, how do you scroll back up?

You can't.


And yes it seems to repeat events.
Several numbers keep getting repeated.

You may read this explanation:
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4965&postid=38135


This also explains why their is no ability for BOINC to accurately calculate the time.
It looks like 3.25 hrs is more like 8 to max 9 hrs on my machine.
Now at 88% with 7hrs 15 mins running and "about" an hour left in remaining time.
ID: 38194 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 576
Credit: 18,029,565
RAC: 23,105
Message 38195 - Posted: 9 Mar 2019, 21:39:32 UTC - in response to Message 38182.  
Last modified: 9 Mar 2019, 22:24:50 UTC

It seems that native ATLAS does not have this problem. I run three WUs at a time, with two cores per WU on my i7-4770 (Ubuntu 16.04).
As a test, I suspended them for one minute, and then resumed them. They picked up with no problem where they left off.

I don't know why VBox has this problem, but it is not inherent to ATLAS (I have LAIM enabled, if that matters).
I've found that if suspended for just 1 minute then native ATLAS will frequently resume from where they left off but not always. When suspended for more than a few minutes the event counter resets to 0. You might not notice that reset in the console but if you track the events by parsing them out of the log files buried deep within subfolders in the shared folder you can see that the event counter definitely resets (if suspended long enough). I suspect it has little to do with LAIM and more to do with the CVMFS cache or something?

I was not sure what would happen with a longer pause, since I run that machine 24/7 and never see it suspend a work unit (native ATLAS is the only CPU job running).
So I suspended all three that were running. Each had about 2 hours left to go, out of a 6 1/2 hour run.
Then, I resumed them 5 hours later. Immediately, all three work units started uploading "results", so clearly something was amiss.

But when I look at the Stderr output, it shows everything as normal. You can probably interpret what is going on better than I can.
https://lhcathome.cern.ch/lhcathome/result.php?resultid=218866963
https://lhcathome.cern.ch/lhcathome/result.php?resultid=218866915
https://lhcathome.cern.ch/lhcathome/result.php?resultid=218873625

Almost as curious is that whoever else tried to run these got invalids after a short time. "Anonymous" needs to find a better use for his machines.
https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=108987793
https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=108987602
https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=108989900

EDIT: I suspect what may have happened is that ATLAS continued to run after I had "suspended" it. It looks like ran for the expected time of about 4 1/2 hours (except one short one). That could explain it.
ID: 38195 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1826
Credit: 123,805,392
RAC: 86,977
Message 38198 - Posted: 9 Mar 2019, 23:05:49 UTC - in response to Message 38195.  

... native ATLAS is the only CPU job running).
So I suspended all three that were running.

Independent from BOINC.
ATLAS native must not be suspended/resumed as it will always start from the scratch.
David Cameron explained somewhere (don't find it ATM) that this is by design of the scientific app.

If you run ATLAS native inside your own VM, then you may suspend the VM instead.
ID: 38198 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : ATLAS application : Atlas Simulation 1.01 (Vbox64) will not finish


©2021 CERN