1) Message boards : ATLAS application : All tasks failing (Message 51218)
Posted 9 days ago by PekkaH
Post:
All Atlas and CMS jobs are still failing.

Atlas error log
2024-11-28 16:37:20 (938304): Guest Log: *** Error codes and diagnostics ***
2024-11-28 16:37:20 (938304): Guest Log: "exeErrorCode": 0,
2024-11-28 16:37:20 (938304): Guest Log: "exeErrorDiag": "",
2024-11-28 16:37:20 (938304): Guest Log: "pilotErrorCode": 1305,
2024-11-28 16:37:20 (938304): Guest Log: "pilotErrorDiag": "Failed to execute payload:/bin/bash: Sim_tf.py: command not found\n",

And on CMS, as others have informed, VM starts but nothing gets executed. This is frustrating.
2) Questions and Answers : Getting started : Cron of CERNVM sends lots of e-mail messages to root@localhost on failure/error (Message 48979)
Posted 5 Dec 2023 by PekkaH
Post:
Hi Again,

This problem is still active, seems that my mailservers root's mail box is full of these msg's (172937 msg in 2 months).

The latest seem to be like below (ip address & domain names obscrured):
=========
Return-Path: <root@localhost>
X-Original-To: postmaster@localhost
Delivered-To: postmaster@localhost
Received: from localhost (unknown [k.l.m.n])
by mail.x.y.z (Postfix) with SMTP id 229D911768
for <postmaster@localhost>; Tue, 5 Dec 2023 10:35:14 +0000 (UTC)
Received: by localhost (sSMTP sendmail emulation); Tue, 05 Dec 2023 11:35:12 +0100
From: "root" <root@localhost>
Date: Tue, 05 Dec 2023 11:35:12 +0100
To: root
Content-Type: text/plain; charset="ANSI_X3.4-1968"
Subject: Anacron job 'cron.daily' on localhost
X-UID: 172936
Status: O

/etc/cron.daily/cernvm-update-notification:

Failed to initialize root file catalog (16 - file catalog failure
========
and like this (obscured):
=============
Return-Path: <root@localhost>
X-Original-To: postmaster@localhost
Delivered-To: postmaster@localhost
Received: from localhost (k.l.m.n)
by mail.x.y.z (Postfix) with SMTP id 009522487E
for <postmaster@localhost>; Sat, 2 Dec 2023 14:36:02 +0000 (UTC)
Received: by localhost (sSMTP sendmail emulation); Sat, 02 Dec 2023 15:36:01 +0100
From: "root" <root@localhost>
Date: Sat, 02 Dec 2023 15:36:01 +0100
To: root
Subject: Cron <root@localhost> rsync -au --delete /home/boinc/cernvm/shared/html/job/ /var/www/html/job/
Content-Type: text/plain; charset=ANSI_X3.4-1968
Auto-Submitted: auto-generated
Precedence: bulk
X-Cron-Env: <XDG_SESSION_ID=2133>
X-Cron-Env: <XDG_RUNTIME_DIR=/run/user/0>
X-Cron-Env: <LANG=C>
X-Cron-Env: <SHELL=/bin/sh>
X-Cron-Env: <HOME=/root>
X-Cron-Env: <PATH=/usr/bin:/bin>
X-Cron-Env: <LOGNAME=root>
X-Cron-Env: <USER=root>
X-UID: 170001
Status: O

rsync: change_dir "/home/boinc/cernvm/shared/html/job" failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1178) [sender=3.1.2]
=================

I can dig for more, would be nice to get rid of these. Seems that this problem happens at least when you have vanilla ubuntu 22.04 server and host "mail" configured the in the network - then that gets flooded from lhc jobs.

Br Pekka
3) Message boards : ATLAS application : All Atlas jobs failing on Win11 host (Message 48941)
Posted 19 Nov 2023 by PekkaH
Post:
And resolution to the problem was to completely reinstall virtualbox (and boinc). Reinstalling boinc only didn't help.
I didn't find the actual cause within virtualbox which was causing the issue though.

Pekka
4) Message boards : ATLAS application : All Atlas jobs failing on Win11 host (Message 48930)
Posted 17 Nov 2023 by PekkaH
Post:
All ATLAS jobs failing.

And I can't figure out why. I have reinstalled boinc already few times (7.24.1), Virtualbox is 7.0.12 (and runs happily other virtual machines) and my host is win11 with latest patches. Atlas jobs have been run successfully with this hw earlier so I don't suspect that either. But boinc/vbox combo is new and I don't remember whether that has rund any Atlas job successfully.

Example of failing job is here https://lhcathome.cern.ch/lhcathome/result.php?resultid=401842192.
Error text in question:

2023-11-17 21:38:57 (9532): Error in deregister parent vdi for VM: -2135228404
Command:
VBoxManage -q closemedium "C:\ProgramData\BOINC/projects/lhcathome.cern.ch_lhcathome/ATLAS_vbox_3.01_image.vdi"
Output:
VBoxManage.exe: error: Cannot close medium 'C:\ProgramData\BOINC\projects\lhcathome.cern.ch_lhcathome\ATLAS_vbox_3.01_image.vdi' because it has 1 child media
VBoxManage.exe: error: Details: code VBOX_E_OBJECT_IN_USE (0x80bb000c), component MediumWrap, interface IMedium, callee IUnknown
VBoxManage.exe: error: Context: "Close()" at line 1875 of file VBoxManageDisk.cpp

2023-11-17 21:38:57 (9532): Could not create VM
2023-11-17 21:38:57 (9532): ERROR: VM failed to start
2023-11-17 21:38:57 (9532): Powering off VM.
2023-11-17 21:38:57 (9532): Deregistering VM. (boinc_98d04b0a3a9f3e97, slot#0)
2023-11-17 21:38:57 (9532): Removing network bandwidth throttle group from VM.
2023-11-17 21:38:57 (9532): Removing VM from VirtualBox.

Has anyone else seen similar or am I alone with these issues? Any suggestions on how to triage the problem further are appreciated.

Pekka
5) Questions and Answers : Getting started : Cron of CERNVM sends lots of e-mail messages to root@localhost on failure/error (Message 48755)
Posted 5 Oct 2023 by PekkaH
Post:
Hi,

sorry being away for few days.

- my records started from 16th sept and I could see the mails coming constantly. It is of course possible that they have been floating earlier but I do not have records on those

I can configure my system back to the same setup so that I could see the mails & problem again. Hopefully I have tomorrow fresh data for you. BTW my setup has ubuntu22.04 servers, win10 and 11 desktops

Br Pekka
6) Questions and Answers : Getting started : Cron of CERNVM sends lots of e-mail messages to root@localhost on failure/error (Message 48717)
Posted 1 Oct 2023 by PekkaH
Post:
I can see the same. Since 2 weeks period my mail has received ~30k msg which originate from the cern vm's. I can of course block in all hosts that the vm's are not allowed to send mails but prefer that they do not send them in 1st place. Triage so far:
- all cluster hosts ip addresses are listed as mail origins. These are linux and win10/win11 boxes.
- win boxes do not have mail systems so the only source can be the task vm itself (into which I do not have access)
- in addition to the msg shown in the thread origin i can also see anacron msg (which I also think originates from inside task vm)

Return-Path: <root@localhost>
X-Original-To: postmaster@localhost
Delivered-To: postmaster@localhost
Received: from localhost (unknown [x.y.t.z])
by mail.dii.daa (Postfix) with SMTP id 2596D117F5
for <postmaster@localhost>; Sun, 1 Oct 2023 20:46:44 +0300 (EEST)
Received: by localhost (sSMTP sendmail emulation); Sun, 01 Oct 2023 19:46:42 +0200
From: "root" <root@localhost>
Date: Sun, 01 Oct 2023 19:46:42 +0200
To: root
Content-Type: text/plain; charset="ANSI_X3.4-1968"
Subject: Anacron job 'cron.daily' on localhost
Content-Length: 112
Lines: 3
X-UID: 27
Status: OR

/etc/cron.daily/cernvm-update-notification:

Failed to initialize root file catalog (16 - file catalog failure)

Br Pekka
7) Message boards : ATLAS application : Uploading stuck (Message 48187)
Posted 3 Jun 2023 by PekkaH
Post:
Hi,
thnx on support everyone. Seems that problem was on cern server side as now all hosts have managed to upload the results. The only change that I did this time was addition of the squid "client_req..." conf option as that was on its default previously. But that change didn't alter the behavior at my end, the queues started to clear itself yesterday afternoon and now all is fine again.
No further actions needed.
/pekka
8) Message boards : ATLAS application : Uploading stuck (Message 48185)
Posted 2 Jun 2023 by PekkaH
Post:
Hi,

as some atlas jobs got upload thru and sixtrack, cms and theory work as expected, I don't suspect squid anymore
Instead, I think there is something causing project backoff on cern server side. One of my hosts managed to upload all jobs whereas there are still 3 more with hanging uploads ...

/pekka
9) Message boards : ATLAS application : Uploading stuck (Message 48183)
Posted 2 Jun 2023 by PekkaH
Post:
Hi,
on client side, transfer tab, I can see lots of project backoff ...
/pekka
10) Message boards : ATLAS application : Uploading stuck (Message 48179)
Posted 2 Jun 2023 by PekkaH
Post:
Hi,

I don't see timestamp issues on those logs which have went thru (the same host). Few atlas jobs have successfully uploaded but many are hanging. My setup has own ntp server which the servers are constantly syncing.
And btw - the setup has been running for months w.o. major issues - atlas upload stuck started to manifest itself around 30.5, 1540 eet. No problems before that for many months.

/pekka
11) Message boards : ATLAS application : Uploading stuck (Message 48175)
Posted 2 Jun 2023 by PekkaH
Post:
Thanx,

Yes, I've ubuntu22.04 & squid 5.2
I added the said conf option but squid -k reconfigure has no effect (at least yet).
I will restart the squid vm ....

/pekka
12) Message boards : ATLAS application : Uploading stuck (Message 48173)
Posted 2 Jun 2023 by PekkaH
Post:
Hi,
I've number of hosts which experience stuck atlas uploads (~20 of them).
I have checked my proxy but other workloads like sixtrack & theory loads correctly so I suspect issues on atlas itself.
Does anyone else experience similar situation?

/Pekka
13) Message boards : ATLAS application : Constant GuruMeditation (Message 46325)
Posted 23 Feb 2022 by PekkaH
Post:
And the resolution in case someone else is stumping into teh same issue:
Apparently my pc has more cpu's per mem for Atlas job i.e. 8vcpu vs 8gb ram. When I limited atlas job to use only 4vcpu's the problem disappeared.
Case closed.
Br Pekka
14) Message boards : ATLAS application : Constant GuruMeditation (Message 46243)
Posted 13 Feb 2022 by PekkaH
Post:
Ok - what is weird here is that some atlas jobs go thru but some do not. Anyway, I will limit cpu number and reinstall virtualbox once again ...
15) Message boards : ATLAS application : Constant GuruMeditation (Message 46241)
Posted 12 Feb 2022 by PekkaH
Post:
Hi,
I could not find a topic about Gurumediation. So, lets start it here.
One of my pc's has constant error with atlas jos i.e. gurumeditation. have you seen similar, and how I should troubleshoot this forward? PS I've 2 other pc's with the same sw conf and they are fine. HW is of course different. Below is one such error log:

2022-02-12 09:49:43 (6672): Status Report: Elapsed Time: '6000.000000'
2022-02-12 09:49:43 (6672): Status Report: CPU Time: '22504.296875'
2022-02-12 11:29:49 (6672): Status Report: Elapsed Time: '12000.000000'
2022-02-12 11:29:49 (6672): Status Report: CPU Time: '49497.281250'
2022-02-12 13:09:55 (6672): Status Report: Elapsed Time: '18000.000000'
2022-02-12 13:09:55 (6672): Status Report: CPU Time: '76492.125000'
2022-02-12 14:50:03 (6672): Status Report: Elapsed Time: '24000.000000'
2022-02-12 14:50:03 (6672): Status Report: CPU Time: '103467.453125'
2022-02-12 16:30:10 (6672): Status Report: Elapsed Time: '30000.000000'
2022-02-12 16:30:10 (6672): Status Report: CPU Time: '130308.265625'
2022-02-12 18:10:20 (6672): Status Report: Elapsed Time: '36000.946377'
2022-02-12 18:10:20 (6672): Status Report: CPU Time: '157015.593750'
2022-02-12 19:28:03 (6672): VM is no longer is a running state. It is in 'GuruMeditation'.
2022-02-12 19:28:03 (6672): VM state change detected. (old = 'Running', new = 'GuruMeditation')
2022-02-12 19:28:03 (6672): Powering off VM.
2022-02-12 19:28:03 (6672): Deregistering VM. (boinc_400ffc08596d1a2b, slot#0)
2022-02-12 19:28:31 (6672): CreateProcess failed! (299).
2022-02-12 19:28:31 (6672): Removing network bandwidth throttle group from VM.
2022-02-12 19:28:36 (6672): CreateProcess failed! (299).
2022-02-12 19:28:37 (6672): Removing VM from VirtualBox.
2022-02-12 19:28:42 (6672): CreateProcess failed! (299).
2022-02-12 19:28:47 (6672): Virtual machine exited.
19:28:57 (6672): called boinc_finish(0)

</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>9J8MDmFIhc0nsSi4apGgGQJmABFKDmABFKDmNv0VDmABFKDmshBo1m_1_r1819161101_ATLAS_result</file_name>
<error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
</message>
]]>
16) Message boards : Theory Application : 196 (0x000000C4) EXIT_DISK_LIMIT_EXCEEDED - how come? (Message 41116)
Posted 30 Dec 2019 by PekkaH
Post:
The same, big number of theory app instances failing
https://lhcathome.cern.ch/lhcathome/result.php?resultid=256633820
Annoying ....



©2024 CERN