Message boards :
ATLAS application :
Processor Time Locks Up Elapsed Time Continues to Climb
Message board moderation
Author | Message |
---|---|
Send message Joined: 27 Sep 04 Posts: 102 Credit: 7,224,652 RAC: 5,492 |
In the last several weeks I've had a ton of WUs where the processor time freezes anywhere between one minute and 5 minutes while elapsed time continues to climb This is confirmed by Resource Monitor VBox monitor shows the job as "running" They will stay like this til I abort them When I look at my Tasks List, after aborting them, many show zero CPU or Elapsed time. I assure you they all had at least an half hour of elapsed time and as I said 1 to 5 minutes of CPU time I was using the current version of VBox, (6.1.12) so I down levelled to the one on the project download page (6.0.14) Still happens Doesn't happen every time, but often enough I thought I'd mention it Anyone else seeing something similar? |
Send message Joined: 2 May 07 Posts: 2090 Credit: 158,714,730 RAC: 128,190 |
Your Atlas-Tasks are showing always 4800 MByte RAM setting: 2020-07-20 03:40:01 (13408): Setting Memory Size for VM. (4800MB) This can be to small. Better are 6250 MB. |
Send message Joined: 27 Sep 04 Posts: 102 Credit: 7,224,652 RAC: 5,492 |
Thanks for the response Made the recommended change and even re-booted to make sure nothing hung around First WU to run 1 minute 54 seconds CPU time over 3 hours of elapsed time |
Send message Joined: 15 Jun 08 Posts: 2401 Credit: 225,331,420 RAC: 123,086 |
Your logfiles show that you configure a 2-core setup with 4800 MB RAM. This is the correct RAM value for a 2-core setup and doesn't require a change. This 2 cores should also be set at your web preferences: https://lhcathome.cern.ch/lhcathome/prefs.php?subset=project If you set a higher CPU value there (or even unlimited) the server will send an <rsc_memory_bound> value of up to 10200 MB used by the BOINC client to calculate the total RAM requirement. That value can't be influenced locally. In addition the number of concurrently running (ATLAS)tasks should be limited to ensure they all fit into the available 16 GB total RAM. Unfortunately a lower CPU value at the web preferences page will lead to a lower credit reward. A weakness caused by the BOINC client's handling of multicore apps. |
Send message Joined: 27 Sep 04 Posts: 102 Credit: 7,224,652 RAC: 5,492 |
Again thanks for the response RAM set back to 4800 The rest of my projects thank you I have ATLAS max_concurrent set to 1 in the app_config file I'll make the project prefs change, I've occasionally gotten a waiting for memory in BOINC Manager, but only when I've been doing other stuff on the computer latest WU clicking along nicely 19:42 CPU time with 10:00 elapsed time |
Send message Joined: 27 Sep 04 Posts: 102 Credit: 7,224,652 RAC: 5,492 |
And the saga continues 3 valid WUs and 18 that I had to abort in the last 5 days Anybody else have any suggestions |
Send message Joined: 9 Jan 15 Posts: 151 Credit: 431,596,822 RAC: 0 |
There is few download error on task: WU download error: couldn't get input files And valid task show: 2020-07-28 08:47:28 (12684): Guest Log: Probing /cvmfs/atlas.cern.ch... Failed! Could be network issue. |
Send message Joined: 27 Sep 04 Posts: 102 Credit: 7,224,652 RAC: 5,492 |
If it is a network issue, it is ONLY on ATLAS I can upload/download all other projects I crunch for, I can access all other web sites I did have problems downloading three xxx.pool.1 files, Ended up aborting them But I really don't think my CPU time problem has anything to do with network |
Send message Joined: 27 Sep 04 Posts: 102 Credit: 7,224,652 RAC: 5,492 |
Just a note for consideration The last 6 Wus that I have aborted had one other aborted WU for the same task Apparently I'm not the only one, eh? |
Send message Joined: 2 May 07 Posts: 2090 Credit: 158,714,730 RAC: 128,190 |
Atlas is not a easy Project under Boinc! You can reduce the work for Atlas to ONE Task with x CPU's you want. But, remember the Formula for RAM you need therefore, is shown in this Thread. Important is a controlled interrupt from your side, when you save a running task. Better is to start them and let them finishing. It need some experience for Atlas from your side. |
Send message Joined: 27 Sep 04 Posts: 102 Credit: 7,224,652 RAC: 5,492 |
I've run ATLAS as a standalone project and under the LHC umbrella since SEPT of 2004 Had a hiccup several years ago and my logs were showing no connection to server Did some network setting tweaks and bumped my bandwidth with my ISP and things settled down up until about 8 weeks ago I was chugging through WUs with no problems whatsoever I have one concurrent task, two CPUs max, and the correct RAM settings in app_config I have made no changes for over a year, until this started happening I dont stop and start BOINC randomly And as i said above The last 6 Wus that I have aborted had one other aborted WU for the same task Last night I had on with 35 seconds of CPU and 3 and a half hours Elapsed Just for grins I left it running Somewhere around 02:30 local (based on current CPU time), it took off and is still running |
Send message Joined: 2 May 07 Posts: 2090 Credit: 158,714,730 RAC: 128,190 |
One idea, when you change Atlas work to Theory work (only) in prefs. Theory is more easy to control and need only one CPU for one task. (about 630 MByte RAM) You can RDP check with alt+F2 or alt+F3 in Boinc. When Theory is working well, then it can be a problem with your network for Atlas-Tasks. |
Send message Joined: 27 Sep 04 Posts: 102 Credit: 7,224,652 RAC: 5,492 |
OK I've turn off Atlas and selected Theory Take a day or two to clear the two Atlas WUs I've already got |
Send message Joined: 27 Sep 04 Posts: 102 Credit: 7,224,652 RAC: 5,492 |
OK Tried Theory Same thing Well, the first two ran OK, then with the rest processor time gets to a random point, then stops while elapsed time keeps going I tried in both uni-thread and MT modes |
Send message Joined: 27 Sep 04 Posts: 102 Credit: 7,224,652 RAC: 5,492 |
Aaaaaaand the next Theory WU is at 3 hours CPU time and counting sigh |
Send message Joined: 9 Jan 15 Posts: 151 Credit: 431,596,822 RAC: 0 |
They can run for days.... Just check that cputime is close to runtime. |
Send message Joined: 2 May 07 Posts: 2090 Credit: 158,714,730 RAC: 128,190 |
In your Boinc-slots folder under cernvm/shared is a dataset runRivet.log. First line is the Task info. At the end of the file, when growing, you can see how long it is running, normal 100k events up to finishing. |
Send message Joined: 27 Sep 04 Posts: 102 Credit: 7,224,652 RAC: 5,492 |
Gunde, That's the point The CPU time stops increasing and never starts up again |
Send message Joined: 27 Sep 04 Posts: 102 Credit: 7,224,652 RAC: 5,492 |
?? Is that file structure on LINUX, because it doesn't exist on my Win10 machine I have a Boinc Data Folder Under that I have Projects and Slots directories among others Current Theory job is running at CPU 00:00:52 Elapsed 00:34:07 in slot 8 Neither the “cernvm\shared” folder or the runRivet.log file exists anywhere in the DATA directory |
Send message Joined: 9 Jan 15 Posts: 151 Credit: 431,596,822 RAC: 0 |
Do you extensions pack added to VirtualBox? If you got that click on task in boinc manager the on left bar look for "Show VM Console". It could be greyed out, if so no session is open and stuck. But it is click able a terminal should open an prompt screen for login show up. If any critical error occurred it mostly post in there if not hit alt+F2 to view job screen. This best view to see if it is running any 'events' and get issues would appear if any. Could also get top (system monitor) using alt+F3. If task failed info would show in stderr but in some cases you could more info from console. In this thread you could see how looks inside console: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4161&postid=29359#29359 (Yeti's checklist) For issues regarding theory you would found common errors in section Theory at https://lhcathome.cern.ch/lhcathome/forum_forum.php?id=89 We would need screen from console or stderr log to find any issue. |
©2024 CERN