Message boards : ATLAS application : Processor Time Locks Up Elapsed Time Continues to Climb
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
keputnam

Send message
Joined: 27 Sep 04
Posts: 102
Credit: 7,086,947
RAC: 1,340
Message 43093 - Posted: 23 Jul 2020, 17:36:11 UTC

In the last several weeks I've had a ton of WUs where the processor time freezes anywhere between one minute and 5 minutes while elapsed time continues to climb This is confirmed by Resource Monitor VBox monitor shows the job as "running"

They will stay like this til I abort them

When I look at my Tasks List, after aborting them, many show zero CPU or Elapsed time. I assure you they all had at least an half hour of elapsed time and as I said 1 to 5 minutes of CPU time

I was using the current version of VBox, (6.1.12) so I down levelled to the one on the project download page (6.0.14) Still happens

Doesn't happen every time, but often enough I thought I'd mention it

Anyone else seeing something similar?

ID: 43093 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,126,074
RAC: 105,437
Message 43094 - Posted: 23 Jul 2020, 19:31:33 UTC

Your Atlas-Tasks are showing always 4800 MByte RAM setting:
2020-07-20 03:40:01 (13408): Setting Memory Size for VM. (4800MB)
This can be to small. Better are 6250 MB.
ID: 43094 · Report as offensive     Reply Quote
keputnam

Send message
Joined: 27 Sep 04
Posts: 102
Credit: 7,086,947
RAC: 1,340
Message 43099 - Posted: 23 Jul 2020, 22:32:01 UTC - in response to Message 43094.  

Thanks for the response Made the recommended change and even re-booted to make sure nothing hung around

First WU to run 1 minute 54 seconds CPU time over 3 hours of elapsed time
ID: 43099 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,938,428
RAC: 137,508
Message 43101 - Posted: 24 Jul 2020, 15:58:55 UTC - in response to Message 43099.  

Your logfiles show that you configure a 2-core setup with 4800 MB RAM.
This is the correct RAM value for a 2-core setup and doesn't require a change.

This 2 cores should also be set at your web preferences:
https://lhcathome.cern.ch/lhcathome/prefs.php?subset=project

If you set a higher CPU value there (or even unlimited) the server will send an <rsc_memory_bound> value of up to 10200 MB used by the BOINC client to calculate the total RAM requirement. That value can't be influenced locally.

In addition the number of concurrently running (ATLAS)tasks should be limited to ensure they all fit into the available 16 GB total RAM.

Unfortunately a lower CPU value at the web preferences page will lead to a lower credit reward.
A weakness caused by the BOINC client's handling of multicore apps.
ID: 43101 · Report as offensive     Reply Quote
keputnam

Send message
Joined: 27 Sep 04
Posts: 102
Credit: 7,086,947
RAC: 1,340
Message 43102 - Posted: 24 Jul 2020, 17:04:08 UTC - in response to Message 43101.  
Last modified: 24 Jul 2020, 17:05:36 UTC

Again thanks for the response


RAM set back to 4800 The rest of my projects thank you

I have ATLAS max_concurrent set to 1 in the app_config file

I'll make the project prefs change, I've occasionally gotten a waiting for memory in BOINC Manager, but only when I've been doing other stuff on the computer

latest WU clicking along nicely 19:42 CPU time with 10:00 elapsed time
ID: 43102 · Report as offensive     Reply Quote
keputnam

Send message
Joined: 27 Sep 04
Posts: 102
Credit: 7,086,947
RAC: 1,340
Message 43115 - Posted: 28 Jul 2020, 17:26:04 UTC
Last modified: 28 Jul 2020, 17:28:27 UTC

And the saga continues

3 valid WUs and 18 that I had to abort in the last 5 days

Anybody else have any suggestions
ID: 43115 · Report as offensive     Reply Quote
Greger

Send message
Joined: 9 Jan 15
Posts: 151
Credit: 431,596,822
RAC: 0
Message 43123 - Posted: 29 Jul 2020, 11:13:40 UTC - in response to Message 43115.  
Last modified: 29 Jul 2020, 11:15:27 UTC

There is few download error on task:
WU download error: couldn't get input files


And valid task show:
2020-07-28 08:47:28 (12684): Guest Log: Probing /cvmfs/atlas.cern.ch... Failed!

2020-07-28 08:47:28 (12684): Guest Log: Probing /cvmfs/atlas-condb.cern.ch... Failed!

2020-07-28 08:47:28 (12684): Guest Log: Probing /cvmfs/grid.cern.ch... Failed!

.................

2020-07-28 08:57:01 (12684): Guest Log: No HITS file was produced


Could be network issue.
ID: 43123 · Report as offensive     Reply Quote
keputnam

Send message
Joined: 27 Sep 04
Posts: 102
Credit: 7,086,947
RAC: 1,340
Message 43125 - Posted: 29 Jul 2020, 19:58:12 UTC - in response to Message 43123.  

If it is a network issue, it is ONLY on ATLAS I can upload/download all other projects I crunch for, I can access all other web sites

I did have problems downloading three xxx.pool.1 files, Ended up aborting them

But I really don't think my CPU time problem has anything to do with network
ID: 43125 · Report as offensive     Reply Quote
keputnam

Send message
Joined: 27 Sep 04
Posts: 102
Credit: 7,086,947
RAC: 1,340
Message 43130 - Posted: 30 Jul 2020, 6:30:47 UTC

Just a note for consideration


The last 6 Wus that I have aborted had one other aborted WU for the same task

Apparently I'm not the only one, eh?
ID: 43130 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,126,074
RAC: 105,437
Message 43133 - Posted: 30 Jul 2020, 8:55:29 UTC - in response to Message 43130.  

Atlas is not a easy Project under Boinc!
You can reduce the work for Atlas to ONE Task with x CPU's you want.
But, remember the Formula for RAM you need therefore, is shown in this Thread.
Important is a controlled interrupt from your side, when you save a running task.
Better is to start them and let them finishing.
It need some experience for Atlas from your side.
ID: 43133 · Report as offensive     Reply Quote
keputnam

Send message
Joined: 27 Sep 04
Posts: 102
Credit: 7,086,947
RAC: 1,340
Message 43137 - Posted: 30 Jul 2020, 16:58:49 UTC - in response to Message 43133.  
Last modified: 30 Jul 2020, 17:00:07 UTC

I've run ATLAS as a standalone project and under the LHC umbrella since SEPT of 2004

Had a hiccup several years ago and my logs were showing no connection to server Did some network setting tweaks and bumped my bandwidth with my ISP and things settled down

up until about 8 weeks ago I was chugging through WUs with no problems whatsoever

I have one concurrent task, two CPUs max, and the correct RAM settings in app_config I have made no changes for over a year, until this started happening

I dont stop and start BOINC randomly


And as i said above
The last 6 Wus that I have aborted had one other aborted WU for the same task

Last night I had on with 35 seconds of CPU and 3 and a half hours Elapsed Just for grins I left it running Somewhere around 02:30 local (based on current CPU time), it took off and is still running
ID: 43137 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,126,074
RAC: 105,437
Message 43142 - Posted: 30 Jul 2020, 19:09:32 UTC

One idea,
when you change Atlas work to Theory work (only) in prefs.
Theory is more easy to control and need only one CPU for one task. (about 630 MByte RAM)
You can RDP check with alt+F2 or alt+F3 in Boinc.
When Theory is working well, then it can be a problem with your network for Atlas-Tasks.
ID: 43142 · Report as offensive     Reply Quote
keputnam

Send message
Joined: 27 Sep 04
Posts: 102
Credit: 7,086,947
RAC: 1,340
Message 43168 - Posted: 2 Aug 2020, 19:30:24 UTC - in response to Message 43142.  

OK

I've turn off Atlas and selected Theory

Take a day or two to clear the two Atlas WUs I've already got
ID: 43168 · Report as offensive     Reply Quote
keputnam

Send message
Joined: 27 Sep 04
Posts: 102
Credit: 7,086,947
RAC: 1,340
Message 43172 - Posted: 3 Aug 2020, 17:24:46 UTC
Last modified: 3 Aug 2020, 17:25:11 UTC

OK Tried Theory

Same thing

Well, the first two ran OK,

then with the rest processor time gets to a random point, then stops while elapsed time keeps going

I tried in both uni-thread and MT modes
ID: 43172 · Report as offensive     Reply Quote
keputnam

Send message
Joined: 27 Sep 04
Posts: 102
Credit: 7,086,947
RAC: 1,340
Message 43173 - Posted: 3 Aug 2020, 19:46:00 UTC - in response to Message 43172.  

Aaaaaaand the next Theory WU is at 3 hours CPU time and counting


sigh
ID: 43173 · Report as offensive     Reply Quote
Greger

Send message
Joined: 9 Jan 15
Posts: 151
Credit: 431,596,822
RAC: 0
Message 43174 - Posted: 3 Aug 2020, 21:49:17 UTC

They can run for days....

Just check that cputime is close to runtime.
ID: 43174 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,126,074
RAC: 105,437
Message 43175 - Posted: 3 Aug 2020, 22:36:47 UTC
Last modified: 3 Aug 2020, 22:38:31 UTC

In your Boinc-slots folder under cernvm/shared is a dataset runRivet.log.
First line is the Task info.
At the end of the file, when growing, you can see how long it is running, normal 100k events up to finishing.
ID: 43175 · Report as offensive     Reply Quote
keputnam

Send message
Joined: 27 Sep 04
Posts: 102
Credit: 7,086,947
RAC: 1,340
Message 43176 - Posted: 3 Aug 2020, 23:09:41 UTC - in response to Message 43174.  
Last modified: 3 Aug 2020, 23:32:10 UTC

Gunde,


That's the point

The CPU time stops increasing and never starts up again
ID: 43176 · Report as offensive     Reply Quote
keputnam

Send message
Joined: 27 Sep 04
Posts: 102
Credit: 7,086,947
RAC: 1,340
Message 43179 - Posted: 3 Aug 2020, 23:42:37 UTC - in response to Message 43175.  

??

Is that file structure on LINUX, because it doesn't exist on my Win10 machine


I have a Boinc Data Folder

Under that I have Projects and Slots directories among others

Current Theory job is running at CPU 00:00:52 Elapsed 00:34:07 in slot 8

Neither the “cernvm\shared” folder or the runRivet.log file exists anywhere in the DATA directory
ID: 43179 · Report as offensive     Reply Quote
Greger

Send message
Joined: 9 Jan 15
Posts: 151
Credit: 431,596,822
RAC: 0
Message 43181 - Posted: 4 Aug 2020, 1:16:34 UTC
Last modified: 4 Aug 2020, 1:27:56 UTC

Do you extensions pack added to VirtualBox?
If you got that click on task in boinc manager the on left bar look for "Show VM Console". It could be greyed out, if so no session is open and stuck.

But it is click able a terminal should open an prompt screen for login show up. If any critical error occurred it mostly post in there if not hit alt+F2 to view job screen. This best view to see if it is running any 'events' and get issues would appear if any. Could also get top (system monitor) using alt+F3.

If task failed info would show in stderr but in some cases you could more info from console.

In this thread you could see how looks inside console: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4161&postid=29359#29359
(Yeti's checklist)

For issues regarding theory you would found common errors in section Theory at https://lhcathome.cern.ch/lhcathome/forum_forum.php?id=89

We would need screen from console or stderr log to find any issue.
ID: 43181 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : ATLAS application : Processor Time Locks Up Elapsed Time Continues to Climb


©2024 CERN