Message boards : ATLAS application : All tasks failing
Message board moderation

To post messages, you must log in.

AuthorMessage
hadron

Send message
Joined: 4 Sep 22
Posts: 92
Credit: 16,008,656
RAC: 5,452
Message 50637 - Posted: 26 Sep 2024, 2:47:41 UTC

Since about 23:30 25 Sept, I have had only one successful task run to completion. All the others have been failing with this in the stderr_txt:
2024-09-25 20:01:36 (15434): 
Command: VBoxManage -q storageattach "boinc_674f437b0a9c5e28" --storagectl "Hard Disk Controller" --port 0 --device 0 --type hdd --mtype multiattach --medium "/var/lib/boinc/projects/lhcathome.cern.ch_lhcathome/ATLAS_vbox_3.01_image.vdi" 
Exit Code: -2135228409
Output:
VBoxManage: error: Cannot attach medium '/var/lib/boinc/projects/lhcathome.cern.ch_lhcathome/ATLAS_vbox_3.01_image.vdi': the media type 'MultiAttach' can only be attached to machines that were created with VirtualBox 4.0 or later
VBoxManage: error: Details: code VBOX_E_INVALID_OBJECT_STATE (0x80bb0007), component SessionMachine, interface IMachine, callee nsISupports
VBoxManage: error: Context: "AttachDevice(Bstr(pszCtl).raw(), port, device, DeviceType_HardDisk, pMedium2Mount)" at line 785 of file VBoxManageStorageController.cpp

2024-09-25 20:01:36 (15434): 
Command: VBoxManage -q closemedium "/var/lib/boinc/projects/lhcathome.cern.ch_lhcathome/ATLAS_vbox_3.01_image.vdi" 
Exit Code: 0
Output:
ID: 50637 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1422
Credit: 9,484,585
RAC: 573
Message 50641 - Posted: 26 Sep 2024, 11:05:40 UTC - in response to Message 50637.  
Last modified: 26 Sep 2024, 11:06:01 UTC

Did you have a look with VirtualBox Manager - Tools - Media, whether you maybe have child media with exclamation marks.
ID: 50641 · Report as offensive     Reply Quote
hadron

Send message
Joined: 4 Sep 22
Posts: 92
Credit: 16,008,656
RAC: 5,452
Message 50651 - Posted: 26 Sep 2024, 18:31:11 UTC - in response to Message 50641.  

Did you have a look with VirtualBox Manager - Tools - Media, whether you maybe have child media with exclamation marks.

None
ID: 50651 · Report as offensive     Reply Quote
Toggleton

Send message
Joined: 4 Mar 17
Posts: 25
Credit: 10,262,043
RAC: 574
Message 50652 - Posted: 26 Sep 2024, 18:48:03 UTC

Looking at your tasks all fail with Exit status 196 (0x000000C4) EXIT_DISK_LIMIT_EXCEEDED and have 12GB Peak disk usage. So it is not your fault, you just have gotten a lot of the 6.09GB task files that fail for everyone. https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=6214
Not sure if that 6GB tasks are still sent. Have not gotten a big one the last hours.
ID: 50652 · Report as offensive     Reply Quote
hadron

Send message
Joined: 4 Sep 22
Posts: 92
Credit: 16,008,656
RAC: 5,452
Message 50653 - Posted: 26 Sep 2024, 20:52:28 UTC - in response to Message 50652.  

Not sure if that 6GB tasks are still sent. Have not gotten a big one the last hours.

That's probably because the queue is empty right now.
ID: 50653 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1823
Credit: 119,020,715
RAC: 16,704
Message 51113 - Posted: 23 Nov 2024, 6:12:49 UTC
Last modified: 23 Nov 2024, 7:08:57 UTC

what's happening with ATLAS ?
tons of tasks since last night erroring out after a few minutes, see:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=416867873

BTW: the download files per task are 1,53GB large (!!!)

P.S.: I just looked up the tasks list of a few other volunteers - same problem there. So at least there's nothing wrong with my hosts.
ID: 51113 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1422
Credit: 9,484,585
RAC: 573
Message 51114 - Posted: 23 Nov 2024, 12:48:30 UTC

From the Guest Log: "pilotErrorDiag": "Failed to execute payload:/bin/bash: Sim_tf.py: command not found\n

No idea whether someone at ATLAS can fix that.
ID: 51114 · Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 7 Aug 14
Posts: 27
Credit: 10,000,233
RAC: 131
Message 51115 - Posted: 23 Nov 2024, 13:02:41 UTC - in response to Message 51114.  

From the Guest Log: "pilotErrorDiag": "Failed to execute payload:/bin/bash: Sim_tf.py: command not found\n

No idea whether someone at ATLAS can fix that.

Same error as https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=6248
Who knows, might even be the same work sent out again !
ID: 51115 · Report as offensive     Reply Quote
Lem Novantotto

Send message
Joined: 24 May 23
Posts: 43
Credit: 2,624,143
RAC: 3,726
Message 51118 - Posted: 23 Nov 2024, 17:33:24 UTC - in response to Message 51115.  

Same error as https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=6248
Who knows, might even be the same work sent out again !


Much disappointing. And a bit surprising, to me, since we're speaking of the CERN.
However I don't know how much useful can actually be the work processed on our PCs, and so how much effort it deserves beside CERN "regular" work.

It's a matter of fact that, among the few distributed computing projects still alive (a small fraction of those running not so many years ago), this one looks more prone to periodical issues (in my limited experience, at least), which I find... unexpected.
--
Bye, Lem
ID: 51118 · Report as offensive     Reply Quote
Profile rbpeake

Send message
Joined: 17 Sep 04
Posts: 105
Credit: 32,824,862
RAC: 40
Message 51119 - Posted: 23 Nov 2024, 17:50:47 UTC - in response to Message 51118.  

[quote]Same error as [url]https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=6248

Much disappointing. And a bit surprising, to me, since we're speaking of the CERN.
However I don't know how much useful can actually be the work processed on our PCs, and so how much effort it deserves beside CERN "regular" work.

It's a matter of fact that, among the few distributed computing projects still alive (a small fraction of those running not so many years ago), this one looks more prone to periodical issues (in my limited experience, at least), which I find... unexpected.
--
Bye, Lem


I certainly get the feeling that this project is not a particularly high priority.
Regards,
Bob P.
ID: 51119 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1823
Credit: 119,020,715
RAC: 16,704
Message 51120 - Posted: 23 Nov 2024, 18:56:40 UTC - in response to Message 51119.  

I certainly get the feeling that this project is not a particularly high priority.
that's exactly what I think, unfortunately :-(
ID: 51120 · Report as offensive     Reply Quote
Dark Angel
Avatar

Send message
Joined: 7 Aug 11
Posts: 105
Credit: 25,221,969
RAC: 7,833
Message 51122 - Posted: 23 Nov 2024, 19:24:51 UTC
Last modified: 23 Nov 2024, 19:29:53 UTC

Come on folks, really??
cvmfs_config probe returns normally so it's not my local machine or network, the data simply isn't there.

[2024-11-23 19:24:26] "exeErrorCode": 0,
[2024-11-23 19:24:26] "exeErrorDiag": "",
[2024-11-23 19:24:26] "pilotErrorCode": 1305,
[2024-11-23 19:24:26] "pilotErrorDiag": "Failed to execute payload:/bin/bash: Sim_tf.py: command not found\n",

So what's new up there since the last time the Altas project had viable work? New work experience kids? New hardware that hasn't been configured? Someone got switched to decaf as a prank?
ID: 51122 · Report as offensive     Reply Quote
rob

Send message
Joined: 4 Mar 11
Posts: 29
Credit: 3,848,900
RAC: 7
Message 51131 - Posted: 24 Nov 2024, 16:38:57 UTC - in response to Message 51122.  

I had four "good" tasks on the 22nd Nov, since then all (16) have failed with "validate error" as the headline. Lots of strange messages:
2024-11-24 13:37:11 (7136): Guest Log: *** Starting ATLAS job. (PandaID=6416690328 taskID=42161013) ***
2024-11-24 13:39:31 (7136): Guest Log: *** Job finished ***
2024-11-24 13:39:31 (7136): Guest Log: *** The last 20 lines of the pilot log: ***
2024-11-24 13:39:31 (7136): Guest Log: 2024-11-24 13:39:22,732 | INFO | generated guid for lfn=HITS.42161013._131760.pool.root.1: 45DB498D-73E2-4806-8741-CB186C50CDEB
2024-11-24 13:39:31 (7136): Guest Log: 2024-11-24 13:39:22,732 | WARNING | aborting payload error diagnosis since an error has already been set: [127, 1187]
2024-11-24 13:39:31 (7136): Guest Log: 2024-11-24 13:39:23,775 | INFO | [payload] execute_payloads thread has finished
2024-11-24 13:39:31 (7136): Guest Log: 2024-11-24 13:39:24,235 | INFO | only monitor.control thread still running - safe to abort: ['<_MainThread(MainThread, started 140077397043008)>', '<ExcThread(monitor, started 140077103560448)>']
2024-11-24 13:39:31 (7136): Guest Log: 2024-11-24 13:39:24,490 | WARNING | job_aborted has been set - aborting pilot monitoring
2024-11-24 13:39:31 (7136): Guest Log: 2024-11-24 13:39:24,490 | INFO | [monitor] control thread has ended
2024-11-24 13:39:31 (7136): Guest Log: 2024-11-24 13:39:29,240 | INFO | all workflow threads have been joined
2024-11-24 13:39:31 (7136): Guest Log: 2024-11-24 13:39:29,240 | INFO | end of generic workflow (traces error code: 0)
2024-11-24 13:39:31 (7136): Guest Log: 2024-11-24 13:39:29,241 | INFO | traces error code: 0
2024-11-24 13:39:31 (7136): Guest Log: 2024-11-24 13:39:29,241 | INFO | pilot has finished (exit code=0, shell exit code=0)
2024-11-24 13:39:31 (7136): Guest Log: 2024-11-24 13:39:29,299 [wrapper] ==== pilot stdout END ====
2024-11-24 13:39:31 (7136): Guest Log: 2024-11-24 13:39:29,303 [wrapper] ==== wrapper stdout RESUME ====
2024-11-24 13:39:31 (7136): Guest Log: 2024-11-24 13:39:29,306 [wrapper] pilotpid: 5928
2024-11-24 13:39:31 (7136): Guest Log: 2024-11-24 13:39:29,309 [wrapper] Pilot exit status: 0
2024-11-24 13:39:31 (7136): Guest Log: 2024-11-24 13:39:29,417 [wrapper] pandaids: 6416690328
2024-11-24 13:39:31 (7136): Guest Log: 2024-11-24 13:39:29,442 [wrapper] cleanup supervisor_pilot 5934 5929
2024-11-24 13:39:31 (7136): Guest Log: 2024-11-24 13:39:29,445 [wrapper] Test setup, not cleaning
2024-11-24 13:39:31 (7136): Guest Log: 2024-11-24 13:39:29,450 [wrapper] apfmon messages muted
2024-11-24 13:39:31 (7136): Guest Log: 2024-11-24 13:39:29,453 [wrapper] ==== wrapper stdout END ====
2024-11-24 13:39:31 (7136): Guest Log: 2024-11-24 13:39:29,456 [wrapper] ==== wrapper stderr END ====


Then after a few more lines I get:


2024-11-24 13:39:31 (7136): Guest Log: -rw-r--r--. 1 atlas atlas 10776 Nov 24 13:39 runtime_log.err
2024-11-24 13:39:31 (7136): Guest Log: -rw-------. 1 atlas atlas 636 Nov 24 13:39 4ILLDma7QZ6nsSi4ap6QjLDmwznN0nGgGQJmq4hLDmSMhKDm50VHnm.diag
2024-11-24 13:39:31 (7136): Guest Log: Looking for outputfile HITS.42161013._131760.pool.root.1
2024-11-24 13:39:31 (7136): Guest Log: No HITS file was produced
2024-11-24 13:39:31 (7136): Guest Log: Successfully finished the ATLAS job!
2024-11-24 13:39:31 (7136): Guest Log: Copying the results back to the shared directory!
2024-11-24 13:39:31 (7136): Guest Log: *** Contents of shared directory: ***
2024-11-24 13:39:32 (7136): Guest Log: total 269908
2024-11-24 13:39:32 (7136): Guest Log: -rwxrwxrwx. 1 root root 275766805 Nov 24 13:36 ATLAS.root_0
2024-11-24 13:39:32 (7136): Guest Log: -rwxrwxrwx. 1 root root 9433 Nov 24 13:36 init_data.xml
2024-11-24 13:39:32 (7136): Guest Log: -rwxrwxrwx. 1 root root 499895 Nov 24 12:42 input.tar.gz
2024-11-24 13:39:32 (7136): Guest Log: -rwxrwxrwx. 1 root root 81920 Nov 24 2024 result.tar.gz
2024-11-24 13:39:32 (7136): Guest Log: -rwxrwxrwx. 1 root root 17569 Nov 24 12:42 start_atlas.sh
2024-11-24 13:39:32 (7136): Guest Log: *** Success! Shutting down the machine. ***
2024-11-24 13:39:32 (7136): VM Completion File Detected.
2024-11-24 13:39:32 (7136): Powering off VM.
2024-11-24 13:39:32 (7136): Successfully stopped VM.


and the VM stops in an orderly manner.

(Meanwhile CMS tasks are happily running on the same computer)
ID: 51131 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1823
Credit: 119,020,715
RAC: 16,704
Message 51134 - Posted: 25 Nov 2024, 4:41:09 UTC - in response to Message 51131.  

(Meanwhile CMS tasks are happily running on the same computer)
CMS tasks are running, but for sure not "happily". They do NOT download jobs right at the beginning, that's why there is no CPU usage and the task finishes after about half an hour. You even get a few credit points, but the result of the task is of NO VALUE to the science.
Normally, there is a mechanism which stops the distribution of tasks as soon as there are no jobs available. However, now this does not seem to work, and already for 5 days many volunteers download and process these CMS tasks - for nothing, unfortunately.
And, even worse, obviously no one at the receiving point of these useless tasks has noticed this so far and stopped this nonsense :-(
ID: 51134 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2244
Credit: 173,902,375
RAC: 206
Message 51136 - Posted: 25 Nov 2024, 9:03:02 UTC

Atlas-Tasks with Creditpoints, but no running Job inside of an Intel-Board.
https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10593998
ID: 51136 · Report as offensive     Reply Quote
Dark Angel
Avatar

Send message
Joined: 7 Aug 11
Posts: 105
Credit: 25,221,969
RAC: 7,833
Message 51141 - Posted: 26 Nov 2024, 6:35:03 UTC

Have we kicked the Trisolarians out of the system yet? Or is it the work experience kid still configuring things?
ID: 51141 · Report as offensive     Reply Quote
PekkaH

Send message
Joined: 23 Dec 19
Posts: 18
Credit: 43,744,045
RAC: 11,832
Message 51218 - Posted: 28 Nov 2024, 16:48:33 UTC

All Atlas and CMS jobs are still failing.

Atlas error log
2024-11-28 16:37:20 (938304): Guest Log: *** Error codes and diagnostics ***
2024-11-28 16:37:20 (938304): Guest Log: "exeErrorCode": 0,
2024-11-28 16:37:20 (938304): Guest Log: "exeErrorDiag": "",
2024-11-28 16:37:20 (938304): Guest Log: "pilotErrorCode": 1305,
2024-11-28 16:37:20 (938304): Guest Log: "pilotErrorDiag": "Failed to execute payload:/bin/bash: Sim_tf.py: command not found\n",

And on CMS, as others have informed, VM starts but nothing gets executed. This is frustrating.
ID: 51218 · Report as offensive     Reply Quote
greg_be

Send message
Joined: 28 Dec 08
Posts: 341
Credit: 4,865,275
RAC: 58
Message 51229 - Posted: 30 Nov 2024, 19:00:21 UTC

ID: 51229 · Report as offensive     Reply Quote

Message boards : ATLAS application : All tasks failing


©2025 CERN