Message boards : ATLAS application : ATLAS problem - long running but not using any CPU
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
hsdecalc

Send message
Joined: 26 Jan 15
Posts: 10
Credit: 6,517,210
RAC: 0
Message 41503 - Posted: 9 Feb 2020, 10:26:01 UTC

Hi, in my cases there is insufficient memory available. See post ..crappy task
There are also jobs running from the project "Amicable Numbers" and they need a huge amount of memory.
So I have to switch between these projects manually..
Why can't Atlas catch the problem and stop execution (not enough memory)?
ID: 41503 · Report as offensive     Reply Quote
greg_be

Send message
Joined: 28 Dec 08
Posts: 318
Credit: 4,148,677
RAC: 2,010
Message 41506 - Posted: 9 Feb 2020, 18:18:32 UTC - in response to Message 41452.  

You need to change concurrent to 1 and processor usage to 4.
That usually solves the problem.
ATLAS is very much a memory hog.
I allocate 10,200MB and it grabs an additional 90 MB virtual and I get a 50% success rate right now.
Not sure what is going on.
But anyway..try the above changes and see if that helps.
Also set your run time to around 8hrs in BOINC.
I can usually chew up a good ATLAS task in 4-6 hrs, but sometimes it needs more time.

Others may have some other advice to offer you...just scroll through this thread. Me and the others have hashed this out to the Nth degree.
ID: 41506 · Report as offensive     Reply Quote
TSUI.Kak-Hee

Send message
Joined: 4 Aug 05
Posts: 8
Credit: 100,466
RAC: 0
Message 41593 - Posted: 15 Feb 2020, 8:33:50 UTC - in response to Message 41506.  
Last modified: 15 Feb 2020, 8:40:27 UTC

hello,thanks for your info. I have the same problem with the very long running WUs, lasting for 2days and more. Still I got 3 long lasting WUs and I aborted them. I have no succesful WUs for now. I wonder where can I find the app_config.xml file? Many thanks!
ID: 41593 · Report as offensive     Reply Quote
hsdecalc

Send message
Joined: 26 Jan 15
Posts: 10
Credit: 6,517,210
RAC: 0
Message 41595 - Posted: 15 Feb 2020, 9:16:44 UTC - in response to Message 41593.  

A lot of informations are here: Checklist Version 3 for Atlas@Home..
Xml-file must be created if you want modifiy some preferences.
Create it in folder ..:\BOINC\projects\lhcathome.cern.ch_lhcathome\app_config.xml.
Example:

<app_config>
<app>
<name>ATLAS</name>
<max_concurrent>1</max_concurrent>
</app>
<!-- 3000 + (900 * Cores) = 2c=4800,3c=5700, 4c=6600 -->
<app_version>
<app_name>ATLAS</app_name>
<avg_ncpus>4.000000</avg_ncpus>
<max_ncpus>4.000000</max_ncpus>
<plan_class>vbox64_mt_mcore_atlas</plan_class>
<cmdline>--memory_size_mb 6600</cmdline>
</app_version>
<project_max_concurrent>2</project_max_concurrent>
</app_config>
ID: 41595 · Report as offensive     Reply Quote
TSUI.Kak-Hee

Send message
Joined: 4 Aug 05
Posts: 8
Credit: 100,466
RAC: 0
Message 41596 - Posted: 15 Feb 2020, 9:20:01 UTC - in response to Message 41595.  

A lot of informations are here: Checklist Version 3 for Atlas@Home..
Xml-file must be created if you want modifiy some preferences.
Create it in folder ..:\BOINC\projects\lhcathome.cern.ch_lhcathome\app_config.xml.
Example:

<app_config>
<app>
<name>ATLAS</name>
<max_concurrent>1</max_concurrent>
</app>
<!-- 3000 + (900 * Cores) = 2c=4800,3c=5700, 4c=6600 -->
<app_version>
<app_name>ATLAS</app_name>
<avg_ncpus>4.000000</avg_ncpus>
<max_ncpus>4.000000</max_ncpus>
<plan_class>vbox64_mt_mcore_atlas</plan_class>
<cmdline>--memory_size_mb 6600</cmdline>
</app_version>
<project_max_concurrent>2</project_max_concurrent>
</app_config>

many thanks! I'll try it out!
ID: 41596 · Report as offensive     Reply Quote
TSUI.Kak-Hee

Send message
Joined: 4 Aug 05
Posts: 8
Credit: 100,466
RAC: 0
Message 41598 - Posted: 15 Feb 2020, 16:57:12 UTC
Last modified: 15 Feb 2020, 17:01:02 UTC

I don't think this is normal at all. It's the only task running now in my boinc. And hardware is ok. Why so weire. I think it will be another dead longrunner. And every WU is the same situation with this. When it goes to 99.999%, it will stop going.. thoes WUs' CPU time vs elapsed time is just like bellow↓↓↓
here is the properties of the task bellow:
==================================================================
Application
ATLAS Simulation 2.00 (vbox64_mt_mcore_atlas)
Name
MaDNDmInANwn9Rq4apoT9bVoABFKDmABFKDmHosUDmABFKDmw5Webm
State
Running
Received
2/15/2020 10:44:26 PM
Report deadline
2/23/2020 9:44:49 PM
Resources
3 CPUs
Estimated computation size
43,200 GFLOPs
CPU time
00:00:45
CPU time since checkpoint
00:00:00
Elapsed time
00:38:17
Estimated time remaining
04:31:45
Fraction done
12.348%
Virtual memory size
105.20 MB
Working set size
5.57 GB
Directory
slots/3
Process ID
1700
Progress rate
19.440% per hour
Executable
vboxwrapper_26198ab7_windows_x86_64.exe
ID: 41598 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,084,038
RAC: 105,553
Message 41599 - Posted: 15 Feb 2020, 22:48:30 UTC

2020-02-16 03:04:18 (9184): Required extension pack not installed, remote desktop not enabled.
You need to install Extension pack in Virtualbox to see more.
Also is Yeti's Checklist in the Atlas-Folder very useful for your first experience with Atlas.
When a Atlas-Task is running more than 4-6 hours, there is something wrong with the installation.
ID: 41599 · Report as offensive     Reply Quote
broz69

Send message
Joined: 28 Nov 08
Posts: 30
Credit: 14,604,005
RAC: 17,594
Message 41605 - Posted: 16 Feb 2020, 20:03:57 UTC - in response to Message 41599.  

2020-02-16 03:04:18 (9184): Required extension pack not installed, remote desktop not enabled.
You need to install Extension pack in Virtualbox to see more.
Also is Yeti's Checklist in the Atlas-Folder very useful for your first experience with Atlas.
When a Atlas-Task is running more than 4-6 hours, there is something wrong with the installation.


Hi again,

Last weekend I observed the following behaviour on computer ID 10570926. The computer was shutdown and the next morning I turned it on. The shutdown procedure was nothing special (I didn't do anything special to running LHC jobs through VBox). There were some Theory and CMS jobs running at the time when I initiated a shutdown. When the machine came up all the VMs started at the same time. I checked VM console and all of them were in emergency shell. I aborted the jobs (all of them at the same time). That's when ATLAS jobs started, all at the same time. After a while I checked VM console in BOINC and all of them were in emergency shell. I aborted the jobs. This was Feb 9.

This weekend I activated my testing machine ID: 10616627. I installed new Win10 1903 build 18362.657, BOINC 7.14.2 (x64) and VBox 6.1.2 r135662 (Qt5.6.2). When I pressed Allow new tasks, BOINC downloaded cca 16 Theory jobs and 4 ATLAS jobs. It started 4 Theory jobs at the same time. I checked VM console in BOINC and 4 jobs were in emergency shell:
* Welcome to micro-Cern-VM
* Release 2018.10-1.cernvm.x86_64

[INF] Loading predefined modules... check
[INF] Starting networking... check
[INF] Getting time from pool.ntp.org... check
[INF] Mounting root filesystem...mount: mounting /dev/disk/by-label/UROOT on /root.rw failed: Input/output error
[ERR] Unable to mount root device /dev/disk/by-label/UROOT!
[INF] Entering rescue console
etc...

And this was exactly the same behaviour on both machines! Both machines have SATA disk for LHC. One has 750GB WD Black and the other 320GB WD Blue (on Standard SATA AHCI Controler, driver from Microsoft ver. 10.0.18362.1). My guess is that starting many VMs at the same time produces some kind of IO errors and then VMs simply stay in that state. And BOINC doesn't know it and simply lets them run forever.

Is there any setting that I can use to delay starting the VMs? It seems that starting many VMs at the same time produces IO errors. Now would be interesting to know if it's VBox or is it that OS in VM has some time-outs that are too low...

On my test machine Theory VM needs around 60-80 sec to copy the VDI image and then another 10-20 sec to start running. So in my case 120 sec time between starting different VMs would be OK. On the other hand ATLAS has bigger VM image so it takes a bit more time. The only thing is I don't know where to set it up - if it's even possible. I know it's possible to do it in Hyper-V but I don't know how to do it in BOINC/VBox combo...
ID: 41605 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,888,480
RAC: 138,335
Message 41606 - Posted: 16 Feb 2020, 21:44:36 UTC - in response to Message 41605.  

Starting and stopping a computer that runs a couple of vbox tasks should be done manually as BOINC has no setting to define a delay.

To plan a shutdown:
- select a single task from your task list and pause it manually
- wait at least 30 s, better 1 min.
- continue with the next task until all tasks are paused.

To continue your tasks:
- Restart your BOINC client
- select a single task from the task list and click the resume button.
- wait at least 30 s, better 1 min.
- continue with the next task until all tasks are resumed.


If too many vbox tasks are paused or stopped concurrently at BOINC shutdown most disks can't catch up with the amount of data that must be written.
After 1 min the BOINC client will consider all data is written and will kill all child processes regardless of their real state.
The problems become visible at next restart of a VM when it tries to access an incomplete disk image.
ID: 41606 · Report as offensive     Reply Quote
broz69

Send message
Joined: 28 Nov 08
Posts: 30
Credit: 14,604,005
RAC: 17,594
Message 41607 - Posted: 16 Feb 2020, 22:34:43 UTC - in response to Message 41606.  

Hello,

Thank you for your answer. This behaviour is exactly what I've seen in the last 4 hours; with a test machine on - no shutdown. BOINC Manager was starting some VBox VMs, stopping others and in the mean time made a bit of a mess. I have 2 Theory jobs that have status "Postponed" and in the VBox manager they are defined/created but just partially - both of them have no disk attached - under "Storage" the disk part is empty.

Then at certain point BOINC Manager decided to switch jobs from Theory to ATLAS. So it paused all Theory jobs and started an ATLAS job. I have now one ATLAS job (d4HODmjyBNwn9Rq4apoT9bVoABFKDmABFKDmbQmVDmABFKDmypIWIn_1) that is defined in VBox and when it's running I can see three different screens in BOINC Manager in VM console using alt+F1, alt+F2 and alt+F3. The only problem is that alt+F2 (ATLAS Event Progress Monitoring) is showing a progress screen where all the numbers are shown as N/A. It seems that VM started but somehow failed to trigger the start of calculations. BOINC Manager shows job as running.

I can't say that what you are saying about the behaviour of BOINC Manager is desirable. But at least I know I have to be careful when shutting down the computer.

Thank you for your effort and explaining this to me.
ID: 41607 · Report as offensive     Reply Quote
broz69

Send message
Joined: 28 Nov 08
Posts: 30
Credit: 14,604,005
RAC: 17,594
Message 41608 - Posted: 16 Feb 2020, 22:44:35 UTC - in response to Message 41607.  

Hello,

Thank you for your answer. This behaviour is exactly what I've seen in the last 4 hours; with a test machine on - no shutdown. BOINC Manager was starting some VBox VMs, stopping others and in the mean time made a bit of a mess. I have 2 Theory jobs that have status "Postponed" and in the VBox manager they are defined/created but just partially - both of them have no disk attached - under "Storage" the disk part is empty.

Then at certain point BOINC Manager decided to switch jobs from Theory to ATLAS. So it paused all Theory jobs and started an ATLAS job. I have now one ATLAS job (d4HODmjyBNwn9Rq4apoT9bVoABFKDmABFKDmbQmVDmABFKDmypIWIn_1) that is defined in VBox and when it's running I can see three different screens in BOINC Manager in VM console using alt+F1, alt+F2 and alt+F3. The only problem is that alt+F2 (ATLAS Event Progress Monitoring) is showing a progress screen where all the numbers are shown as N/A. It seems that VM started but somehow failed to trigger the start of calculations. BOINC Manager shows job as running.

I can't say that what you are saying about the behaviour of BOINC Manager is desirable. But at least I know I have to be careful when shutting down the computer.

Thank you for your effort and explaining this to me.


Correction - it seems like ATLAS job needed almost 20 min to get the data and while I was writing the answer above it started crunching numbers. So it's not stalled...
ID: 41608 · Report as offensive     Reply Quote
broz69

Send message
Joined: 28 Nov 08
Posts: 30
Credit: 14,604,005
RAC: 17,594
Message 41610 - Posted: 17 Feb 2020, 8:19:55 UTC - in response to Message 41608.  

Hi,

Another thing that I noticed - ATLAS job needs more than 160 sec from the moment I resume the job in BOINC Manager to start running. In this time the disk is active 100% of time. Since the BOINC setup is that it reads the image from BOINC\projects\lhcathome.cern.ch_lhcathome and copies them to BOINC\slots. All the read and write operations are on one phisycal disk. The disk is WDC WD3200BEVT-22ZCT0.
The solution for me would also be if somehow I could say to VBox/BOINC that the repository of LHC images is on one disk and working set of disks are somewhere else. I have other three hard disks that I could use to spread the disk load...

Best regards.
ID: 41610 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,888,480
RAC: 138,335
Message 41611 - Posted: 17 Feb 2020, 8:47:50 UTC - in response to Message 41610.  

If you have enough RAM the vdi file from projects\lhcathome.cern.ch_lhcathome will be read only once and then remains in the disk cache.
Otherwise you may check whether you can create \slots\ on a separate disk/volume and mount that volume as a directory.

Be aware that LHC@home can deal with data that are spread over multiple filesystems (at least on linux) but other BOINC projects, e.g. Primegrid, expect all data on the same filesystem.
ID: 41611 · Report as offensive     Reply Quote
TSUI.Kak-Hee

Send message
Joined: 4 Aug 05
Posts: 8
Credit: 100,466
RAC: 0
Message 41613 - Posted: 17 Feb 2020, 18:07:59 UTC
Last modified: 17 Feb 2020, 18:53:05 UTC

(task info:LHC@home 2.00 ATLAS Simulation (vbox64_mt_mcore_atlas) I6oMDmbUvNwn9Rq4apoT9bVoABFKDmABFKDmyAPODmABFKDmOJouBo_0 00:09:38 (00:00:37) 6.5 2.151 07:18:52 24/2/2020 23:57:40 2C Running DESKTOP-HI7SD4Q)

Hi every one. I found these lines in the Vbox. It is the task log. This may explain why low CPU time and dead longrunner.. :

00:00:30.109769 VMMDev: Guest Log: Checking CVMFS...
00:00:32.145923 VMMDev: Guest Log: Failed to check CVMFS, check output from cvmfs_config probe:
00:00:32.283947 VMMDev: Guest Log: Probing /cvmfs/atlas.cern.ch... Failed!
00:00:32.347130 VMMDev: Guest Log: Probing /cvmfs/atlas-condb.cern.ch... Failed!
00:00:32.448251 VMMDev: Guest Log: Probing /cvmfs/grid.cern.ch... Failed!
00:00:45.711948 VMMDev: Guest Log: VBoxService 5.2.32 r132073 (verbosity: 0) linux.amd64 (Jul 12 2019 10:32:28) release log
00:00:45.711967 VMMDev: Guest Log: 00:00:00.000157 main Log opened 2020-02-18T01:57:50.219551000Z
00:00:45.712037 VMMDev: Guest Log: 00:00:00.000264 main OS Product: Linux
00:00:45.712067 VMMDev: Guest Log: 00:00:00.000297 main OS Release: 3.10.0-957.27.2.el7.x86_64
00:00:45.712090 VMMDev: Guest Log: 00:00:00.000321 main OS Version: #1 SMP Mon Jul 29 17:46:05 UTC 2019
00:00:45.712116 VMMDev: Guest Log: 00:00:00.000345 main Executable: /opt/VBoxGuestAdditions-5.2.32/sbin/VBoxService
00:00:45.712121 VMMDev: Guest Log: 00:00:00.000346 main Process ID: 1911
00:00:45.712124 VMMDev: Guest Log: 00:00:00.000346 main Package type: LINUX_64BITS_GENERIC
00:00:45.714479 VMMDev: Guest Log: 00:00:00.002706 main 5.2.32 r132073 started. Verbose level = 0
00:00:45.715899 VMMDev: Guest Log: 00:00:00.004078 main Error: Service 'control' failed to initialize: VERR_INVALID_PARAMETER
00:00:45.716028 VMMDev: Guest Log: 00:00:00.004245 main Session 0 is about to close ...
00:00:45.716053 VMMDev: Guest Log: 00:00:00.004273 main Stopping all guest processes ...
00:00:45.716077 VMMDev: Guest Log: 00:00:00.004297 main Closing all guest files ...
00:00:45.719211 VMMDev: Guest Log: 00:00:00.007423 main Ended.
00:00:45.719410 VMMDev: Guest Additions capability report: (0x0 -> 0x0) seamless: no, hostWindowMapping: no, graphics: no




Still I cant figure out how to fix these, and I'm pretty sure ports needed are opened
ID: 41613 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,888,480
RAC: 138,335
Message 41614 - Posted: 17 Feb 2020, 19:34:49 UTC - in response to Message 41613.  

You may upgrade VirtualBox to a more recent version.
Since v6.1.2 causes problems on some hosts you may use v6.0.16 instead.

Your host https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10632069 is listed with 8GB RAM but it tries to start VMs with 14GB (!!)
https://lhcathome.cern.ch/lhcathome/result.php?resultid=263965750
2020-02-18 01:42:50 (10048): Setting Memory Size for VM. (14000MB)
2020-02-18 01:42:50 (10048): Setting CPU Count for VM. (2)



Your host https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10632657 is configured to start a 3-core-setup using 6600MB RAM which doesn't match the official RAM formula:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=263925839
2020-02-17 21:53:27 (6448): Setting Memory Size for VM. (6600MB)
2020-02-17 21:53:28 (6448): Setting CPU Count for VM. (3)



You may go through Yeti's checklist and adjust your preferences and may be the settings in an app_config.xml according to the tips given there.
ID: 41614 · Report as offensive     Reply Quote
TSUI.Kak-Hee

Send message
Joined: 4 Aug 05
Posts: 8
Credit: 100,466
RAC: 0
Message 41620 - Posted: 18 Feb 2020, 11:04:36 UTC - in response to Message 41614.  

You may upgrade VirtualBox to a more recent version.
Since v6.1.2 causes problems on some hosts you may use v6.0.16 instead.

Your host https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10632069 is listed with 8GB RAM but it tries to start VMs with 14GB (!!)
https://lhcathome.cern.ch/lhcathome/result.php?resultid=263965750
2020-02-18 01:42:50 (10048): Setting Memory Size for VM. (14000MB)
2020-02-18 01:42:50 (10048): Setting CPU Count for VM. (2)



Your host https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10632657 is configured to start a 3-core-setup using 6600MB RAM which doesn't match the official RAM formula:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=263925839
2020-02-17 21:53:27 (6448): Setting Memory Size for VM. (6600MB)
2020-02-17 21:53:28 (6448): Setting CPU Count for VM. (3)



You may go through Yeti's checklist and adjust your preferences and may be the settings in an app_config.xml according to the tips given there.


Thank you and greg_be's help so very much! Now I finally and basically figured out how to work it right. with you guys' help, I finally get the first loads of tasks done. You are so helpful!
some key points means a lot from my point of view:
1. When using LeoMoon CPU-V, remember not only the 2 big green ticks have to apear, but also the 3 small ticks down below have to be tick not "X". One of my machine have just the 2 big green ticks but not the 3 small ones down there, so every task faild. like this: https://lhcathome.cern.ch/lhcathome/result.php?resultid=263633925
2. Memory, disk space and app_config.xml. Should check through Yeti's checklist to make sure having enough disk and memory for task. ONE TASK AT ONE TIME for starters is great. Scripts in app_config.xml have to be official calculation, like user computezrmle said above. Otherwise, task won't start right.
3. Ports and internet environment. Make sure ports needed are opened( usually they are opened). I test 2 different internet environments. One is using a mobile hotspot wifi(phone is Huawei P30 and data is ChinaUnicom 4G), which gives the task a 192.169.XXX.XXX DNS address. In this situation, I found in the log that task can‘t probe CVMFS right, and they never really starts( though progress is running in boinc, but in eFMer's BoincTasks I saw nearly 0% CPU usage), and eventually the task will be a 3\4\5 days dead longrunner.
logs are like this:

2020-02-18 01:43:34 (10048): Guest Log: Checking CVMFS...
2020-02-18 01:43:36 (10048): Guest Log: Failed to check CVMFS, check output from cvmfs_config probe:
2020-02-18 01:43:36 (10048): Guest Log: Probing /cvmfs/atlas.cern.ch... Failed!
2020-02-18 01:43:36 (10048): Guest Log: Probing /cvmfs/atlas-condb.cern.ch... Failed!
2020-02-18 01:43:36 (10048): Guest Log: Probing /cvmfs/grid.cern.ch... Failed!


By changing to another internet environment(ChinaTelecom Macau, iphone6s wired hotspot, with DNS adress like 172.XXX.XXX.XXX), the tasks seemed to read CVMFS well, and tasks start and end up successfully:

2020-02-18 13:37:10 (9824): Guest Log: Checking CVMFS...
......
2020-02-18 13:37:58 (9824): Guest Log: CVMFS is ok
2020-02-18 13:37:59 (9824): Guest Log: Mounting shared directory
2020-02-18 13:37:59 (9824): Guest Log: Copying input files
2020-02-18 13:38:02 (9824): Guest Log: Copied input files into RunAtlas.
2020-02-18 13:38:05 (9824): Guest Log: copied the webapp to /var/www
2020-02-18 13:38:05 (9824): Guest Log: This vm does not need to setup an http proxy
2020-02-18 13:38:05 (9824): Guest Log: ATHENA_PROC_NUMBER=3
2020-02-18 13:38:05 (9824): Guest Log: *** Starting ATLAS job. (PandaID=4647590640 taskID=20597032) ***
......
2020-02-18 17:25:23 (9824): Guest Log: *** Success! Shutting down the machine. ***


Make sure you see the success starting line, otherwise you should check through all memory, port settings. Best to go through Yeti's checklist detail by detail.
I'll keep on testing to see where is truelly the port/internet environment porblems are. Things metioned above might give a hint.

*Notice the line " *** Starting ATLAS job. ". If you don't see this line in the logs, your task is not really started.
*Another key feature to check whether the task is really started or not is to look at windows task manager. When the task progress goes beyond 10~15% in Boinc mgr, the task should be using very high level of energy , which is showed in windows task mgr.
-where to find the logs? Open Vbox, mark the VM that is running, right mouse click, and click "show log". You will see scripts above if your task runs well. Logs also can be found in :/program data/boinc/slots/, then in the slot folder your task is in.

Best luck for all of you,
and thanks again for your great help![/quote]
ID: 41620 · Report as offensive     Reply Quote
Previous · 1 · 2

Message boards : ATLAS application : ATLAS problem - long running but not using any CPU


©2024 CERN