Message boards : Theory Application : Theory Tasks on various hosts failing since last night
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Erich56

Send message
Joined: 18 Dec 15
Posts: 1908
Credit: 144,739,398
RAC: 81,620
Message 50705 - Posted: 6 Oct 2024, 5:58:10 UTC

this morning, I noticed that on some hosts which are crunching Theory, tasks have failed after various timespans, with different errors shown in stderr.
Examples are:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=414779420
https://lhcathome.cern.ch/lhcathome/result.php?resultid=414775111
https://lhcathome.cern.ch/lhcathome/result.php?resultid=414775111
https://lhcathome.cern.ch/lhcathome/result.php?resultid=414774639
https://lhcathome.cern.ch/lhcathome/result.php?resultid=414779729

What could be the reason for this .problem?

FYI, on several other hosts CMS is running WITHOUT any problems.
ID: 50705 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2681
Credit: 286,839,635
RAC: 63,998
Message 50706 - Posted: 6 Oct 2024, 7:04:05 UTC - in response to Message 50705.  

Found this:
<core_client_version>7.24.1</core_client_version>
<![CDATA[
<message>
couldn't start app: Task file Theory_2024_04_30_prod.xml: file has the wrong size</message>
]]>

Next task:
<core_client_version>7.24.1</core_client_version>
<![CDATA[
<message>
couldn't start app: Task file Theory_2024_04_30_prod.xml: file missing</message>
]]>

A few tasks later:
VBoxManage.exe: error: Failed to write screenshot to file 'C:\ProgramData\BOINC\slots\1/vbox_screenshot.png' (VERR_DISK_FULL).


You may need to
- reset the project
- clean the VirtualBox environment
- check the available disk space and BOINC's disk quota
- resume the project
ID: 50706 · Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 7 Aug 14
Posts: 27
Credit: 10,000,924
RAC: 0
Message 50707 - Posted: 6 Oct 2024, 7:50:38 UTC - in response to Message 50706.  

One for the programmer to fix...

VBoxManage -q controlvm "boinc_4db4d3640e56fc7f" screenshotpng "C:\ProgramData\BOINC\slots\1/vbox_screenshot.png"
ID: 50707 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2681
Credit: 286,839,635
RAC: 63,998
Message 50708 - Posted: 6 Oct 2024, 8:05:54 UTC - in response to Message 50707.  

Nope.
It is a valid command.
ID: 50708 · Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 7 Aug 14
Posts: 27
Credit: 10,000,924
RAC: 0
Message 50709 - Posted: 6 Oct 2024, 8:25:07 UTC - in response to Message 50708.  

Indeed !
In VBoxManage, awful.
ID: 50709 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1908
Credit: 144,739,398
RAC: 81,620
Message 50710 - Posted: 6 Oct 2024, 11:55:02 UTC - in response to Message 50706.  

Found this:
...
A few tasks later:
VBoxManage.exe: error: Failed to write screenshot to file 'C:\ProgramData\BOINC\slots\1/vbox_screenshot.png' (VERR_DISK_FULL).

You may need to
- reset the project
- clean the VirtualBox environment
- check the available disk space and BOINC's disk quota
- resume the project
many thanks, computezrmle, for investigating. What I meanwhile noticed is that some of the recent Theory tasks ("Herwig7") are consuming considerably more disk space than all the other ones so far.
And in my specific case, the problem is that I had transferred the slots file to a ramdisk with 7GB. With the setting of 2 tasks running simultaneously, 7GB is not enough. When these specific tasks start, a huge download begins to create a 6GB vdi in the slots file / boinc ... / Snapshots.
Further, console2 does not show the usual progress with number of events processed, but rather kind of horizontal percentage bars starting with "Integrate 12 of 760", (for example)

So, at the moment, for testing purposes on one of the hosts affected, I run only 1 Theory task at a time, for the future I could increase the ramdisk to something like 12,5 or 13GB, which might cause problems with the remaining system RAM, since total RAM is 16GB. I can only try and see what the outcome is.
ID: 50710 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1908
Credit: 144,739,398
RAC: 81,620
Message 50711 - Posted: 6 Oct 2024, 13:05:12 UTC - in response to Message 50710.  

I just realize that such a Herwig7 task will not get finished within the expiration time of 10 days set by the project (in fact, it will take them between 15 and 18 days).
My hosts on which I run Theory are pretty old and slow, with 2-core (+2 HT) CPUs - that's why so far I used them for no other sub-projects than Theory.

Running the current 4-core CMS on an old 2 core + 2 HT CPU might not be the best idea, I think I'd rather try ATLAS with 2 or 3 core setting.
Any comments or other ideas ?
ID: 50711 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1461
Credit: 9,855,658
RAC: 2,747
Message 50713 - Posted: 6 Oct 2024, 14:22:14 UTC - in response to Message 50711.  

Any comments or other ideas ?
@Erich: I looked to some of my results and
a pythia8 result gave job diskusage=3380, but the result shows peak disk usage 1.11 GB
A herwig7 gave job diskusage=7280 and the result wrote peak disk usage 1.52 GB

I may show you 2 ATLAS-tasks from my laptop with similar CPU as yours and running on three threads:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=414728551
https://lhcathome.cern.ch/lhcathome/result.php?resultid=414707625
ID: 50713 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1908
Credit: 144,739,398
RAC: 81,620
Message 50715 - Posted: 6 Oct 2024, 15:23:48 UTC - in response to Message 50713.  
Last modified: 6 Oct 2024, 15:30:31 UTC

Crystal Pellet, many thanks for the Information.

Since no ATLAS tasks are available at the moment, I started a CMS task on one of these 2-core / 2-HT hosts. So I'll see how long it takes it go get finished. I am afraid very long time.
BTW, for some reason, Theory tasks are also not available right now.
ID: 50715 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1461
Credit: 9,855,658
RAC: 2,747
Message 50718 - Posted: 6 Oct 2024, 16:59:17 UTC - in response to Message 50715.  

.... I started a CMS task on one of these 2-core / 2-HT hosts. So I'll see how long it takes it go get finished. I am afraid very long time.
Be aware that CMS has a job duration of about 12 hours up to a maximum of 18 hours.
After 18 hours your task will be killed even when it's the first job running inside the VM.
That would be a pity cause not returning valid scientific work (and probably no credits).
ID: 50718 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1908
Credit: 144,739,398
RAC: 81,620
Message 50719 - Posted: 6 Oct 2024, 17:23:17 UTC - in response to Message 50718.  
Last modified: 6 Oct 2024, 18:05:50 UTC

thank you for the information, I was not aware of this detail.
So I'll know tomorrow morning :-)
Too bad that no ATLAS tasks are around :-(
ID: 50719 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1908
Credit: 144,739,398
RAC: 81,620
Message 50722 - Posted: 7 Oct 2024, 3:53:20 UTC - in response to Message 50719.  

thank you for the information, I was not aware of this detail.
So I'll know tomorrow morning :-)
...
the CMS task was over in 12 hours - not so bad for this old CPU :-)
ID: 50722 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1461
Credit: 9,855,658
RAC: 2,747
Message 50748 - Posted: 8 Oct 2024, 17:08:37 UTC - in response to Message 50710.  
Last modified: 9 Oct 2024, 5:14:46 UTC

What I meanwhile noticed is that some of the recent Theory tasks ("Herwig7") are consuming considerably more disk space than all the other ones so far.
........
When these specific tasks start, a huge download begins to create a 6GB vdi in the slots file / boinc ... / Snapshots.
Further, console2 does not show the usual progress with number of events processed, but rather kind of horizontal percentage bars starting with "Integrate 12 of 760", (for example)
You're right. Herwig7 shows behaviour, I did not noticed before.
These new 760 integration steps each with 4 iteratons going to 100% takes a very long time.
After over 150 minutes runtime 8 out of 760 integration steps are done. Finally that should result in the processing of 6000 events. I'll see how long that lasts. https://lhcathome.cern.ch/lhcathome/result.php?resultid=414815923
===> [runRivet] Tue Oct 8 14:12:41 UTC 2024 [boinc pp z1j 8000 30 - herwig7 7.2.1 nlo 6000 142]

Btw: The slotfolder for this task contains 5.74GB data, whereof the differencing vdi-image in the snapshot-folder is 6019072 kB very slowly increasing.

EDIT: 09-Oct-2024 00:17:08 [LHC@home] Aborting task Theory_2794-3267759-142_1: exceeded disk limit: 9321.23MB > 7629.39MB

For Theory a total of 8,000,000,000 bytes in the slot-folder is allowed.
ID: 50748 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 780
Credit: 59,888,811
RAC: 47,724
Message 50751 - Posted: 9 Oct 2024, 8:09:47 UTC - in response to Message 50748.  
Last modified: 9 Oct 2024, 8:19:16 UTC

The 760 integration steps takes about 60-70 hours to run on my new Ryzen 9 7950X. For my I7-7820X I estimated this to take about 140 hours. In addition to that integration phase the tasks will run the normal event crunching phase as previous Theory tasks do.

Unfortunately these new tasks are quite prone to 'Missing heartbeat' errors. I've lost at least 5 tasks already for that.

[edit] I've finished one task successfully so far: https://lhcathome.cern.ch/lhcathome/result.php?resultid=414777076
ID: 50751 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1908
Credit: 144,739,398
RAC: 81,620
Message 50752 - Posted: 9 Oct 2024, 8:17:39 UTC - in response to Message 50748.  

I don't think it was a good idea to sent out these Herwig7 tasks.

I stopped downloading Theory tasks and switched to CMS which run okay even on the old and weak CPUs. On one them, I am trying ATLAS - the task is still in process, so I cannot tell yet how well (or not) it finally works.
ID: 50752 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2681
Credit: 286,839,635
RAC: 63,998
Message 50753 - Posted: 9 Oct 2024, 8:25:09 UTC - in response to Message 50751.  

Please check at console 2 if the VM uses lots of RAM/swap.
If so, you may try a higher RAM value via app_config.xml.
ID: 50753 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1908
Credit: 144,739,398
RAC: 81,620
Message 50754 - Posted: 9 Oct 2024, 8:29:33 UTC - in response to Message 50753.  

Please check at console 2 if the VM uses lots of RAM/swap.
If so, you may try a higher RAM value via app_config.xml.
hm, how can I see this in console 2?
FYI, the current app_config.xml setting is: 3 cores, 5700MB RAM.
ID: 50754 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2681
Credit: 286,839,635
RAC: 63,998
Message 50756 - Posted: 9 Oct 2024, 8:42:31 UTC - in response to Message 50754.  

hm, how can I see this in console 2?
FYI, the current app_config.xml setting is: 3 cores, 5700MB RAM.

Sorry, Typo.
Top output on Console 3

FYI, the current app_config.xml setting is: 3 cores, 5700MB RAM.

I guess that's for your CMS VMs.
Your Theory VMs are configured to use 630 MB and 1 core.
https://lhcathome.cern.ch/lhcathome/result.php?resultid=414769302
2024-10-06 01:19:13 (5924): Setting Memory Size for VM. (630MB)
2024-10-06 01:19:13 (5924): Setting CPU Count for VM. (1)
ID: 50756 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1461
Credit: 9,855,658
RAC: 2,747
Message 50757 - Posted: 9 Oct 2024, 8:48:59 UTC - in response to Message 50753.  

Please check at console 2 if the VM uses lots of RAM/swap.
If so, you may try a higher RAM value via app_config.xml.
(Console 3:) I already had set 768 MB for the Theory VMs. A running Herwig7 shows within the VM:
KiB Mem: 744976 total, 65940 free, 387080 used 291020 buff/cache
KiB Swap: 1048572 total, 871920 free. 176652 used. 203840 avail Mem
ID: 50757 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1908
Credit: 144,739,398
RAC: 81,620
Message 50758 - Posted: 9 Oct 2024, 8:52:40 UTC - in response to Message 50756.  

...
I guess that's for your CMS VMs.
no, for ATLAS VMs. I (wrongly) thought you were referring to my message saying that I am trying an ATLAS task right now on one of these old machines.
In fact: I was not even aware that Theory now again can run on more than 1 core; I think this was the case long time ago (if I remember right, but I might be mistaken), and then Theory was switched back to 1 core only.
Also, I now have checked the app_config.xml files on various hosts - on none of them I have a setting for RAM MB. So obviously, I had never done this before
ID: 50758 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Theory Application : Theory Tasks on various hosts failing since last night


©2025 CERN