Thread 'Processor Time Locks Up Elapsed Time Continues to Climb'

Author	Message
keputnam Send message Joined: 27 Sep 04 Posts: 111 Credit: 8,604,494 RAC: 43	Message 43093 - Posted: 23 Jul 2020, 17:36:11 UTC In the last several weeks I've had a ton of WUs where the processor time freezes anywhere between one minute and 5 minutes while elapsed time continues to climb This is confirmed by Resource Monitor VBox monitor shows the job as "running" They will stay like this til I abort them When I look at my Tasks List, after aborting them, many show zero CPU or Elapsed time. I assure you they all had at least an half hour of elapsed time and as I said 1 to 5 minutes of CPU time I was using the current version of VBox, (6.1.12) so I down levelled to the one on the project download page (6.0.14) Still happens Doesn't happen every time, but often enough I thought I'd mention it Anyone else seeing something similar? ID: 43093 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2301 Credit: 179,694,921 RAC: 30,876	Message 43094 - Posted: 23 Jul 2020, 19:31:33 UTC Your Atlas-Tasks are showing always 4800 MByte RAM setting: 2020-07-20 03:40:01 (13408): Setting Memory Size for VM. (4800MB) This can be to small. Better are 6250 MB. ID: 43094 · Reply Quote

keputnam Send message Joined: 27 Sep 04 Posts: 111 Credit: 8,604,494 RAC: 43	Message 43099 - Posted: 23 Jul 2020, 22:32:01 UTC - in response to Message 43094. Thanks for the response Made the recommended change and even re-booted to make sure nothing hung around First WU to run 1 minute 54 seconds CPU time over 3 hours of elapsed time ID: 43099 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2753 Credit: 303,217,842 RAC: 94,966	Message 43101 - Posted: 24 Jul 2020, 15:58:55 UTC - in response to Message 43099. Your logfiles show that you configure a 2-core setup with 4800 MB RAM. This is the correct RAM value for a 2-core setup and doesn't require a change. This 2 cores should also be set at your web preferences: https://lhcathome.cern.ch/lhcathome/prefs.php?subset=project If you set a higher CPU value there (or even unlimited) the server will send an <rsc_memory_bound> value of up to 10200 MB used by the BOINC client to calculate the total RAM requirement. That value can't be influenced locally. In addition the number of concurrently running (ATLAS)tasks should be limited to ensure they all fit into the available 16 GB total RAM. Unfortunately a lower CPU value at the web preferences page will lead to a lower credit reward. A weakness caused by the BOINC client's handling of multicore apps. ID: 43101 · Reply Quote

keputnam Send message Joined: 27 Sep 04 Posts: 111 Credit: 8,604,494 RAC: 43	Message 43102 - Posted: 24 Jul 2020, 17:04:08 UTC - in response to Message 43101. Last modified: 24 Jul 2020, 17:05:36 UTC Again thanks for the response RAM set back to 4800 The rest of my projects thank you I have ATLAS max_concurrent set to 1 in the app_config file I'll make the project prefs change, I've occasionally gotten a waiting for memory in BOINC Manager, but only when I've been doing other stuff on the computer latest WU clicking along nicely 19:42 CPU time with 10:00 elapsed time ID: 43102 · Reply Quote

keputnam Send message Joined: 27 Sep 04 Posts: 111 Credit: 8,604,494 RAC: 43	Message 43115 - Posted: 28 Jul 2020, 17:26:04 UTC Last modified: 28 Jul 2020, 17:28:27 UTC And the saga continues 3 valid WUs and 18 that I had to abort in the last 5 days Anybody else have any suggestions ID: 43115 · Reply Quote

Greger Send message Joined: 9 Jan 15 Posts: 151 Credit: 431,596,822 RAC: 0	Message 43123 - Posted: 29 Jul 2020, 11:13:40 UTC - in response to Message 43115. Last modified: 29 Jul 2020, 11:15:27 UTC There is few download error on task: WU download error: couldn't get input files And valid task show: 2020-07-28 08:47:28 (12684): Guest Log: Probing /cvmfs/atlas.cern.ch... Failed! 2020-07-28 08:47:28 (12684): Guest Log: Probing /cvmfs/atlas-condb.cern.ch... Failed! 2020-07-28 08:47:28 (12684): Guest Log: Probing /cvmfs/grid.cern.ch... Failed! ................. 2020-07-28 08:57:01 (12684): Guest Log: No HITS file was produced Could be network issue. ID: 43123 · Reply Quote

keputnam Send message Joined: 27 Sep 04 Posts: 111 Credit: 8,604,494 RAC: 43	Message 43125 - Posted: 29 Jul 2020, 19:58:12 UTC - in response to Message 43123. If it is a network issue, it is ONLY on ATLAS I can upload/download all other projects I crunch for, I can access all other web sites I did have problems downloading three xxx.pool.1 files, Ended up aborting them But I really don't think my CPU time problem has anything to do with network ID: 43125 · Reply Quote

keputnam Send message Joined: 27 Sep 04 Posts: 111 Credit: 8,604,494 RAC: 43	Message 43130 - Posted: 30 Jul 2020, 6:30:47 UTC Just a note for consideration The last 6 Wus that I have aborted had one other aborted WU for the same task Apparently I'm not the only one, eh? ID: 43130 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2301 Credit: 179,694,921 RAC: 30,876	Message 43133 - Posted: 30 Jul 2020, 8:55:29 UTC - in response to Message 43130. Atlas is not a easy Project under Boinc! You can reduce the work for Atlas to ONE Task with x CPU's you want. But, remember the Formula for RAM you need therefore, is shown in this Thread. Important is a controlled interrupt from your side, when you save a running task. Better is to start them and let them finishing. It need some experience for Atlas from your side. ID: 43133 · Reply Quote

keputnam Send message Joined: 27 Sep 04 Posts: 111 Credit: 8,604,494 RAC: 43	Message 43137 - Posted: 30 Jul 2020, 16:58:49 UTC - in response to Message 43133. Last modified: 30 Jul 2020, 17:00:07 UTC I've run ATLAS as a standalone project and under the LHC umbrella since SEPT of 2004 Had a hiccup several years ago and my logs were showing no connection to server Did some network setting tweaks and bumped my bandwidth with my ISP and things settled down up until about 8 weeks ago I was chugging through WUs with no problems whatsoever I have one concurrent task, two CPUs max, and the correct RAM settings in app_config I have made no changes for over a year, until this started happening I dont stop and start BOINC randomly And as i said above The last 6 Wus that I have aborted had one other aborted WU for the same task Last night I had on with 35 seconds of CPU and 3 and a half hours Elapsed Just for grins I left it running Somewhere around 02:30 local (based on current CPU time), it took off and is still running ID: 43137 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2301 Credit: 179,694,921 RAC: 30,876	Message 43142 - Posted: 30 Jul 2020, 19:09:32 UTC One idea, when you change Atlas work to Theory work (only) in prefs. Theory is more easy to control and need only one CPU for one task. (about 630 MByte RAM) You can RDP check with alt+F2 or alt+F3 in Boinc. When Theory is working well, then it can be a problem with your network for Atlas-Tasks. ID: 43142 · Reply Quote

keputnam Send message Joined: 27 Sep 04 Posts: 111 Credit: 8,604,494 RAC: 43	Message 43168 - Posted: 2 Aug 2020, 19:30:24 UTC - in response to Message 43142. OK I've turn off Atlas and selected Theory Take a day or two to clear the two Atlas WUs I've already got ID: 43168 · Reply Quote

keputnam Send message Joined: 27 Sep 04 Posts: 111 Credit: 8,604,494 RAC: 43	Message 43172 - Posted: 3 Aug 2020, 17:24:46 UTC Last modified: 3 Aug 2020, 17:25:11 UTC OK Tried Theory Same thing Well, the first two ran OK, then with the rest processor time gets to a random point, then stops while elapsed time keeps going I tried in both uni-thread and MT modes ID: 43172 · Reply Quote

keputnam Send message Joined: 27 Sep 04 Posts: 111 Credit: 8,604,494 RAC: 43	Message 43173 - Posted: 3 Aug 2020, 19:46:00 UTC - in response to Message 43172. Aaaaaaand the next Theory WU is at 3 hours CPU time and counting sigh ID: 43173 · Reply Quote

Greger Send message Joined: 9 Jan 15 Posts: 151 Credit: 431,596,822 RAC: 0	Message 43174 - Posted: 3 Aug 2020, 21:49:17 UTC They can run for days.... Just check that cputime is close to runtime. ID: 43174 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2301 Credit: 179,694,921 RAC: 30,876	Message 43175 - Posted: 3 Aug 2020, 22:36:47 UTC Last modified: 3 Aug 2020, 22:38:31 UTC In your Boinc-slots folder under cernvm/shared is a dataset runRivet.log. First line is the Task info. At the end of the file, when growing, you can see how long it is running, normal 100k events up to finishing. ID: 43175 · Reply Quote

keputnam Send message Joined: 27 Sep 04 Posts: 111 Credit: 8,604,494 RAC: 43	Message 43176 - Posted: 3 Aug 2020, 23:09:41 UTC - in response to Message 43174. Last modified: 3 Aug 2020, 23:32:10 UTC Gunde, That's the point The CPU time stops increasing and never starts up again ID: 43176 · Reply Quote

keputnam Send message Joined: 27 Sep 04 Posts: 111 Credit: 8,604,494 RAC: 43	Message 43179 - Posted: 3 Aug 2020, 23:42:37 UTC - in response to Message 43175. ?? Is that file structure on LINUX, because it doesn't exist on my Win10 machine I have a Boinc Data Folder Under that I have Projects and Slots directories among others Current Theory job is running at CPU 00:00:52 Elapsed 00:34:07 in slot 8 Neither the â€œcernvm\sharedâ€ folder or the runRivet.log file exists anywhere in the DATA directory ID: 43179 · Reply Quote

Greger Send message Joined: 9 Jan 15 Posts: 151 Credit: 431,596,822 RAC: 0	Message 43181 - Posted: 4 Aug 2020, 1:16:34 UTC Last modified: 4 Aug 2020, 1:27:56 UTC Do you extensions pack added to VirtualBox? If you got that click on task in boinc manager the on left bar look for "Show VM Console". It could be greyed out, if so no session is open and stuck. But it is click able a terminal should open an prompt screen for login show up. If any critical error occurred it mostly post in there if not hit alt+F2 to view job screen. This best view to see if it is running any 'events' and get issues would appear if any. Could also get top (system monitor) using alt+F3. If task failed info would show in stderr but in some cases you could more info from console. In this thread you could see how looks inside console: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4161&postid=29359#29359 (Yeti's checklist) For issues regarding theory you would found common errors in section Theory at https://lhcathome.cern.ch/lhcathome/forum_forum.php?id=89 We would need screen from console or stderr log to find any issue. ID: 43181 · Reply Quote