Message boards :
Theory Application :
Theory Tasks on various hosts failing since last night
Message board moderation
Author | Message |
---|---|
Send message Joined: 18 Dec 15 Posts: 1908 Credit: 144,739,398 RAC: 81,620 ![]() ![]() ![]() |
this morning, I noticed that on some hosts which are crunching Theory, tasks have failed after various timespans, with different errors shown in stderr. Examples are: https://lhcathome.cern.ch/lhcathome/result.php?resultid=414779420 https://lhcathome.cern.ch/lhcathome/result.php?resultid=414775111 https://lhcathome.cern.ch/lhcathome/result.php?resultid=414775111 https://lhcathome.cern.ch/lhcathome/result.php?resultid=414774639 https://lhcathome.cern.ch/lhcathome/result.php?resultid=414779729 What could be the reason for this .problem? FYI, on several other hosts CMS is running WITHOUT any problems. |
![]() Send message Joined: 15 Jun 08 Posts: 2681 Credit: 286,839,635 RAC: 63,998 ![]() ![]() |
Found this: <core_client_version>7.24.1</core_client_version> <![CDATA[ <message> couldn't start app: Task file Theory_2024_04_30_prod.xml: file has the wrong size</message> ]]> Next task: <core_client_version>7.24.1</core_client_version> <![CDATA[ <message> couldn't start app: Task file Theory_2024_04_30_prod.xml: file missing</message> ]]> A few tasks later: VBoxManage.exe: error: Failed to write screenshot to file 'C:\ProgramData\BOINC\slots\1/vbox_screenshot.png' (VERR_DISK_FULL). You may need to - reset the project - clean the VirtualBox environment - check the available disk space and BOINC's disk quota - resume the project |
![]() Send message Joined: 7 Aug 14 Posts: 27 Credit: 10,000,924 RAC: 0 ![]() ![]() |
One for the programmer to fix... VBoxManage -q controlvm "boinc_4db4d3640e56fc7f" screenshotpng "C:\ProgramData\BOINC\slots\1/vbox_screenshot.png" |
![]() Send message Joined: 15 Jun 08 Posts: 2681 Credit: 286,839,635 RAC: 63,998 ![]() ![]() |
Nope. It is a valid command. |
![]() Send message Joined: 7 Aug 14 Posts: 27 Credit: 10,000,924 RAC: 0 ![]() ![]() |
Indeed ! In VBoxManage, awful. |
Send message Joined: 18 Dec 15 Posts: 1908 Credit: 144,739,398 RAC: 81,620 ![]() ![]() ![]() |
Found this:many thanks, computezrmle, for investigating. What I meanwhile noticed is that some of the recent Theory tasks ("Herwig7") are consuming considerably more disk space than all the other ones so far. And in my specific case, the problem is that I had transferred the slots file to a ramdisk with 7GB. With the setting of 2 tasks running simultaneously, 7GB is not enough. When these specific tasks start, a huge download begins to create a 6GB vdi in the slots file / boinc ... / Snapshots. Further, console2 does not show the usual progress with number of events processed, but rather kind of horizontal percentage bars starting with "Integrate 12 of 760", (for example) So, at the moment, for testing purposes on one of the hosts affected, I run only 1 Theory task at a time, for the future I could increase the ramdisk to something like 12,5 or 13GB, which might cause problems with the remaining system RAM, since total RAM is 16GB. I can only try and see what the outcome is. |
Send message Joined: 18 Dec 15 Posts: 1908 Credit: 144,739,398 RAC: 81,620 ![]() ![]() ![]() |
I just realize that such a Herwig7 task will not get finished within the expiration time of 10 days set by the project (in fact, it will take them between 15 and 18 days). My hosts on which I run Theory are pretty old and slow, with 2-core (+2 HT) CPUs - that's why so far I used them for no other sub-projects than Theory. Running the current 4-core CMS on an old 2 core + 2 HT CPU might not be the best idea, I think I'd rather try ATLAS with 2 or 3 core setting. Any comments or other ideas ? |
Send message Joined: 14 Jan 10 Posts: 1461 Credit: 9,855,658 RAC: 2,747 ![]() ![]() |
Any comments or other ideas ?@Erich: I looked to some of my results and a pythia8 result gave job diskusage=3380, but the result shows peak disk usage 1.11 GB A herwig7 gave job diskusage=7280 and the result wrote peak disk usage 1.52 GB I may show you 2 ATLAS-tasks from my laptop with similar CPU as yours and running on three threads: https://lhcathome.cern.ch/lhcathome/result.php?resultid=414728551 https://lhcathome.cern.ch/lhcathome/result.php?resultid=414707625 |
Send message Joined: 18 Dec 15 Posts: 1908 Credit: 144,739,398 RAC: 81,620 ![]() ![]() ![]() |
Crystal Pellet, many thanks for the Information. Since no ATLAS tasks are available at the moment, I started a CMS task on one of these 2-core / 2-HT hosts. So I'll see how long it takes it go get finished. I am afraid very long time. BTW, for some reason, Theory tasks are also not available right now. |
Send message Joined: 14 Jan 10 Posts: 1461 Credit: 9,855,658 RAC: 2,747 ![]() ![]() |
.... I started a CMS task on one of these 2-core / 2-HT hosts. So I'll see how long it takes it go get finished. I am afraid very long time.Be aware that CMS has a job duration of about 12 hours up to a maximum of 18 hours. After 18 hours your task will be killed even when it's the first job running inside the VM. That would be a pity cause not returning valid scientific work (and probably no credits). |
Send message Joined: 18 Dec 15 Posts: 1908 Credit: 144,739,398 RAC: 81,620 ![]() ![]() ![]() |
thank you for the information, I was not aware of this detail. So I'll know tomorrow morning :-) Too bad that no ATLAS tasks are around :-( |
Send message Joined: 18 Dec 15 Posts: 1908 Credit: 144,739,398 RAC: 81,620 ![]() ![]() ![]() |
thank you for the information, I was not aware of this detail.the CMS task was over in 12 hours - not so bad for this old CPU :-) |
Send message Joined: 14 Jan 10 Posts: 1461 Credit: 9,855,658 RAC: 2,747 ![]() ![]() |
What I meanwhile noticed is that some of the recent Theory tasks ("Herwig7") are consuming considerably more disk space than all the other ones so far.You're right. Herwig7 shows behaviour, I did not noticed before. These new 760 integration steps each with 4 iteratons going to 100% takes a very long time. After over 150 minutes runtime 8 out of 760 integration steps are done. Finally that should result in the processing of 6000 events. I'll see how long that lasts. https://lhcathome.cern.ch/lhcathome/result.php?resultid=414815923 ===> [runRivet] Tue Oct 8 14:12:41 UTC 2024 [boinc pp z1j 8000 30 - herwig7 7.2.1 nlo 6000 142] Btw: The slotfolder for this task contains 5.74GB data, whereof the differencing vdi-image in the snapshot-folder is 6019072 kB very slowly increasing. EDIT: 09-Oct-2024 00:17:08 [LHC@home] Aborting task Theory_2794-3267759-142_1: exceeded disk limit: 9321.23MB > 7629.39MB For Theory a total of 8,000,000,000 bytes in the slot-folder is allowed. |
![]() Send message Joined: 28 Sep 04 Posts: 780 Credit: 59,888,811 RAC: 47,724 ![]() ![]() ![]() |
The 760 integration steps takes about 60-70 hours to run on my new Ryzen 9 7950X. For my I7-7820X I estimated this to take about 140 hours. In addition to that integration phase the tasks will run the normal event crunching phase as previous Theory tasks do. Unfortunately these new tasks are quite prone to 'Missing heartbeat' errors. I've lost at least 5 tasks already for that. [edit] I've finished one task successfully so far: https://lhcathome.cern.ch/lhcathome/result.php?resultid=414777076 ![]() |
Send message Joined: 18 Dec 15 Posts: 1908 Credit: 144,739,398 RAC: 81,620 ![]() ![]() ![]() |
I don't think it was a good idea to sent out these Herwig7 tasks. I stopped downloading Theory tasks and switched to CMS which run okay even on the old and weak CPUs. On one them, I am trying ATLAS - the task is still in process, so I cannot tell yet how well (or not) it finally works. |
![]() Send message Joined: 15 Jun 08 Posts: 2681 Credit: 286,839,635 RAC: 63,998 ![]() ![]() |
Please check at console 2 if the VM uses lots of RAM/swap. If so, you may try a higher RAM value via app_config.xml. |
Send message Joined: 18 Dec 15 Posts: 1908 Credit: 144,739,398 RAC: 81,620 ![]() ![]() ![]() |
Please check at console 2 if the VM uses lots of RAM/swap.hm, how can I see this in console 2? FYI, the current app_config.xml setting is: 3 cores, 5700MB RAM. |
![]() Send message Joined: 15 Jun 08 Posts: 2681 Credit: 286,839,635 RAC: 63,998 ![]() ![]() |
hm, how can I see this in console 2? Sorry, Typo. Top output on Console 3 FYI, the current app_config.xml setting is: 3 cores, 5700MB RAM. I guess that's for your CMS VMs. Your Theory VMs are configured to use 630 MB and 1 core. https://lhcathome.cern.ch/lhcathome/result.php?resultid=414769302 2024-10-06 01:19:13 (5924): Setting Memory Size for VM. (630MB) 2024-10-06 01:19:13 (5924): Setting CPU Count for VM. (1) |
Send message Joined: 14 Jan 10 Posts: 1461 Credit: 9,855,658 RAC: 2,747 ![]() ![]() |
Please check at console 2 if the VM uses lots of RAM/swap.(Console 3:) I already had set 768 MB for the Theory VMs. A running Herwig7 shows within the VM: KiB Mem: 744976 total, 65940 free, 387080 used 291020 buff/cache KiB Swap: 1048572 total, 871920 free. 176652 used. 203840 avail Mem |
Send message Joined: 18 Dec 15 Posts: 1908 Credit: 144,739,398 RAC: 81,620 ![]() ![]() ![]() |
...no, for ATLAS VMs. I (wrongly) thought you were referring to my message saying that I am trying an ATLAS task right now on one of these old machines. In fact: I was not even aware that Theory now again can run on more than 1 core; I think this was the case long time ago (if I remember right, but I might be mistaken), and then Theory was switched back to 1 core only. Also, I now have checked the app_config.xml files on various hosts - on none of them I have a setting for RAM MB. So obviously, I had never done this before |
©2025 CERN