Questions and Answers : Getting started : Cron of CERNVM sends lots of e-mail messages to root@localhost on failure/error
Message board moderation

To post messages, you must log in.

AuthorMessage
alverb

Send message
Joined: 4 Mar 20
Posts: 5
Credit: 2,984,039
RAC: 1,578
Message 48688 - Posted: 29 Sep 2023, 12:10:41 UTC

On failure/error of scheduled task execution the crond of cernvm sends lots of e-mail messages (about 150 per hour) to root@localhost.
Here are some sample messages:


From: "root" <root@localhost>
To: root
Subject: Cron <root@localhost> rsync -au --delete /home/boinc/cernvm/shared/html/job/ /var/www/html/job/

rsync: change_dir "/home/boinc/cernvm/shared/html/job" failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1178) [sender=3.1.2]



From: "root" <root@localhost>
To: root
Subject: Anacron job 'cron.daily' on localhost

/etc/cron.daily/cernvm-update-notification:

Failed to initialize root file catalog (16 - file catalog failure)


Is it possible to stop sending such messages?
ID: 48688 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2541
Credit: 254,608,838
RAC: 34,609
Message 48689 - Posted: 29 Sep 2023, 12:44:59 UTC - in response to Message 48688.  

For a closer look you should
- make your computers visible for other volunteers (https://lhcathome.cern.ch/lhcathome/prefs.php?subset=project)
- post a link to the computer where you see got those logs from
- post a link to an example task that computer has already reported
- describe how/where you got the snippets from
ID: 48689 · Report as offensive     Reply Quote
alverb

Send message
Joined: 4 Mar 20
Posts: 5
Credit: 2,984,039
RAC: 1,578
Message 48693 - Posted: 29 Sep 2023, 14:02:23 UTC - in response to Message 48689.  

For a closer look you should
- make your computers visible for other volunteers (https://lhcathome.cern.ch/lhcathome/prefs.php?subset=project)
- post a link to the computer where you see got those logs from
- post a link to an example task that computer has already reported
- describe how/where you got the snippets from


- Computer: https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10641093
- Task: https://lhcathome.cern.ch/lhcathome/result.php?resultid=399846006

I have local mail server which receives the messages in it's root mailbox.

Here is a sample message source:

Return-Path: <root@localhost>
Delivered-To: admin@example.com
Received: from localhost (localhost [127.0.0.1])
by mail.example.com (mail) with ESMTP id ID
for <postmaster@localhost>; Fri, 29 Sep 2023 13:33:43 +0300 (EEST)
X-Virus-Scanned: amavis
Received: from mail.example.com ([127.0.0.1])
by localhost (mail.example.com [127.0.0.1]) (amavis, port 10024)
with ESMTP id ID for <postmaster@localhost>;
Fri, 29 Sep 2023 13:33:43 +0300 (EEST)
Received: from localhost (unknown [1.2.3.4])
by mail.example.com (mail) with SMTP id ID
for <postmaster@localhost>; Fri, 29 Sep 2023 13:33:42 +0300 (EEST)
Received: by localhost (sSMTP sendmail emulation); Fri, 29 Sep 2023 13:33:01 +0300
From: "root" <root@localhost>
Date: Fri, 29 Sep 2023 13:33:01 +0300
To: root
Subject: Cron <root@localhost> rsync -au --delete /home/boinc/cernvm/shared/html/job/ /var/www/html/job/
Content-Type: text/plain; charset=ANSI_X3.4-1968
Auto-Submitted: auto-generated
Precedence: bulk
X-Cron-Env: <XDG_SESSION_ID=108>
X-Cron-Env: <XDG_RUNTIME_DIR=/run/user/0>
X-Cron-Env: <LANG=C>
X-Cron-Env: <SHELL=/bin/sh>
X-Cron-Env: <HOME=/root>
X-Cron-Env: <PATH=/usr/bin:/bin>
X-Cron-Env: <LOGNAME=root>
X-Cron-Env: <USER=root>

rsync: change_dir "/home/boinc/cernvm/shared/html/job" failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1178) [sender=3.1.2]


In the .VDI file of the task I've found "/persistent/etc/crond/sync-plots" cron tab causing this behavior:
* * * * * root rsync -au --delete /home/boinc/cernvm/shared/html/job/ /var/www/html/job/
ID: 48693 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2541
Credit: 254,608,838
RAC: 34,609
Message 48694 - Posted: 29 Sep 2023, 16:13:52 UTC - in response to Message 48693.  

Looks like you use the reserved domain "example.com" in your local environment.
According to RFC6761 "example.", "example.com.", "example.net." and "example.org." should never be used for that since they are reserved for documentation purpose only.
See:
https://www.rfc-editor.org/rfc/rfc6761.html


If you need a local (sub-)domain without official delegation use the reserved domain name "home.arpa" as defined in RFC8375:
https://www.rfc-editor.org/rfc/rfc8375.html
ID: 48694 · Report as offensive     Reply Quote
alverb

Send message
Joined: 4 Mar 20
Posts: 5
Credit: 2,984,039
RAC: 1,578
Message 48701 - Posted: 30 Sep 2023, 2:01:01 UTC - in response to Message 48694.  

Sorry, I didn't mention that I've replaced all sensitive data with common ones to protect the real ones. Like "mydomain.tld" with "example.com" etc.
ID: 48701 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2244
Credit: 173,902,375
RAC: 456
Message 48702 - Posted: 30 Sep 2023, 2:43:55 UTC - in response to Message 48701.  

Do you have Acronis Cyber Protect Home?
You have to disable Secure, or allow Boinc-Folder in Acronis.
ID: 48702 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2541
Credit: 254,608,838
RAC: 34,609
Message 48706 - Posted: 30 Sep 2023, 7:45:42 UTC - in response to Message 48701.  

Since you obfuscated relevant data it is not possible to give a qualified answer regarding your mail issue.
In case the vdi file is broken
- set LHC@home to NNT
- report all tasks
- do a project reset
- resume work fetch


Nonetheless, your logs show a couple of other weird entries.

1.
Computer 10641093 reports less than 8 GB RAM.
LHC@home expect at least 16 GB

2.
Your tasks suffer from series of suspend/resume.
This can break network transfers and puts a huge load on the IO system.

3.
The computer reports 4 cores and your ATLAS tasks are configured to use all of them.
Then you throttle BOINC to 90% CPU usage.
VirtualBox recommends not to run VMs with more than 50 % of the available cores per VM.
=> 2 cores would be the limit on your computer for ATLAS VMs

4.
You defined an HTTP proxy at 192.168.1.1:8080 and the log shows that socket can be contacted.
Later the log shows CVMFS makes DIRECT connections.
This points out either the proxy rejects connections from the client(s) or the proxy can't contact any internet servers.
=> check your proxy setup and your firewall.
ID: 48706 · Report as offensive     Reply Quote
PekkaH

Send message
Joined: 23 Dec 19
Posts: 18
Credit: 43,743,882
RAC: 20,445
Message 48717 - Posted: 1 Oct 2023, 18:53:58 UTC - in response to Message 48694.  

I can see the same. Since 2 weeks period my mail has received ~30k msg which originate from the cern vm's. I can of course block in all hosts that the vm's are not allowed to send mails but prefer that they do not send them in 1st place. Triage so far:
- all cluster hosts ip addresses are listed as mail origins. These are linux and win10/win11 boxes.
- win boxes do not have mail systems so the only source can be the task vm itself (into which I do not have access)
- in addition to the msg shown in the thread origin i can also see anacron msg (which I also think originates from inside task vm)

Return-Path: <root@localhost>
X-Original-To: postmaster@localhost
Delivered-To: postmaster@localhost
Received: from localhost (unknown [x.y.t.z])
by mail.dii.daa (Postfix) with SMTP id 2596D117F5
for <postmaster@localhost>; Sun, 1 Oct 2023 20:46:44 +0300 (EEST)
Received: by localhost (sSMTP sendmail emulation); Sun, 01 Oct 2023 19:46:42 +0200
From: "root" <root@localhost>
Date: Sun, 01 Oct 2023 19:46:42 +0200
To: root
Content-Type: text/plain; charset="ANSI_X3.4-1968"
Subject: Anacron job 'cron.daily' on localhost
Content-Length: 112
Lines: 3
X-UID: 27
Status: OR

/etc/cron.daily/cernvm-update-notification:

Failed to initialize root file catalog (16 - file catalog failure)

Br Pekka
ID: 48717 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2541
Credit: 254,608,838
RAC: 34,609
Message 48731 - Posted: 3 Oct 2023, 8:50:34 UTC

Let's go back to alverb's OP:
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=6049&postid=48688

Inside a VM cron indeed calls "rsync" once every minute.
Cron also calls "cernvm-update-notification" once every minute.

In both cases there's no "MAILTO=" definition included in the cron files.
/etc/crontab defines "MAILTO=root".
The latter sends a mail to the local root account inside the VM under certain circumstances.

Those mails must not appear outside the VM since sender and recepient are both located inside the VM:
From: "root" <root@localhost>
To: root

It's still unclear where those mails appear since alverb's computer list doesn't show any Linux hosts.



@Pekka
You mentioned those mails appear for about 2 weeks.
Last Theory app update was 2022-11-07.

All mail addresses shown in your example look like "xyz@localhost" or plain "root" which are both valid either on the host or inside each VM.


@both
You may consider a project reset to get a fresh vdi file just in case the one you are using got damaged.
ID: 48731 · Report as offensive     Reply Quote
alverb

Send message
Joined: 4 Mar 20
Posts: 5
Credit: 2,984,039
RAC: 1,578
Message 48735 - Posted: 3 Oct 2023, 13:17:13 UTC - in response to Message 48731.  

I confirm that I've found cron tasks inside the VDI files of LHC@home CernVms that are causing this behavior.
In another Linux VM I've examined two copies of CernVMs .VDI files from different projects (Theory Simulation and ATLAS Simulation) which led me to these conclusions.

The mails were slipping away from the Windows based PCs running LHC@home and received by our Linux based mail server (where by default is local "root" account hence <root@localhost> and alias "postmaster" hence <postmaster@localhost> pointing to "root"). I don't have Linux hosts running LHC@home so I can't confirm that they behave the same way. I think they will do so, as @PekkaH confirmed, because the applications are based on same VDI images.

All this was before doing the steps suggested by @computezrmle.
So I've done the following on all machines running LHC@home (one with 8 GB and one with 16 GB of RAM):
- set LHC@home to "No New Tasks";
- waited all tasks to be reported;
- had done project reset;
- had resumed the work fetch;

- set LHC@home to use no more than 50 % of the available cores.

Since then both hosts had completed several ATLAS Simulation tasks without sending bulk e-mail messages. Till now there are no new tasks from the other LHC@home applications and I can't confirm if they are looking good too.

Just for the test, today I've set back the usage of CPU cores to 100% again.

Concerning connectivity, I have http proxy server in the network and although both PCs are allowed to connect directly to Internet the BOINC client somehow doesn't communicate correctly without explicit proxy settings.
To exclude any rejects, on the proxy server I've set direct connections to hosts and URL patterns containing following Cern hosts:
alice.cern.ch
atlas.cern.ch
atlas-condb.cern.ch
atlas-nightlies.cern.ch
cernvm-prod.cern.ch
cvmfs-config.cern.ch
grid.cern.ch
lhcathome.cern.ch
lhcathome-upload.cern.ch
sft.cern.ch
sft-nightlies.cern.ch
unpacked.cern.ch

I know that would be easier with the whole domain "cern.ch" but I have my considerations not to do so.

I'll try to keep you informed if there are or there aren't any issues.

@computezrmle thank you for your help!
@PekkaH thank you for confirming that I'm not the only one with such issues!
ID: 48735 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2541
Credit: 254,608,838
RAC: 34,609
Message 48742 - Posted: 3 Oct 2023, 14:47:38 UTC - in response to Message 48735.  

Do you use Squid and the squid.conf suggested here?
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5473
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5474


Are you aware that none of the addresses below are host or domain names?
Instead they are CVMFS repository names that just look like FQDNs.
To exclude any rejects, on the proxy server I've set direct connections to hosts and URL patterns containing following Cern hosts:
alice.cern.ch
atlas.cern.ch
atlas-condb.cern.ch
atlas-nightlies.cern.ch
cernvm-prod.cern.ch
cvmfs-config.cern.ch
grid.cern.ch
sft.cern.ch
sft-nightlies.cern.ch
unpacked.cern.ch


The only real FQDNs from your list are these:
lhcathome.cern.ch
lhcathome-upload.cern.ch
ID: 48742 · Report as offensive     Reply Quote
alverb

Send message
Joined: 4 Mar 20
Posts: 5
Credit: 2,984,039
RAC: 1,578
Message 48747 - Posted: 4 Oct 2023, 6:59:42 UTC - in response to Message 48742.  

I use Squid but not in such exact manner but in more complex way. That's why I'm just fine tuning the running configuration by adding ACLs to get work better with LHC@home and others.

Are you aware that none of the addresses below are host or domain names?
Instead they are CVMFS repository names that just look like FQDNs.

Yes, I found that the "addresses" are not host nor domain names but part of URL (eg "http://s1cern-cvmfs.openhtc.io/cvmfs/atlas-nightlies.cern.ch/"). That's why I'm using "url_regex" but not "dstdomain".

The only ACLs I've missed up are in the not caching part. Even without that part all request were direct.

Regarding "mail issue" - till now there are not bulk messages but still all hosts are receiving only "ATLAS Simulation" tasks.

Best Regards!
ID: 48747 · Report as offensive     Reply Quote
PekkaH

Send message
Joined: 23 Dec 19
Posts: 18
Credit: 43,743,882
RAC: 20,445
Message 48755 - Posted: 5 Oct 2023, 15:19:05 UTC - in response to Message 48731.  

Hi,

sorry being away for few days.

- my records started from 16th sept and I could see the mails coming constantly. It is of course possible that they have been floating earlier but I do not have records on those

I can configure my system back to the same setup so that I could see the mails & problem again. Hopefully I have tomorrow fresh data for you. BTW my setup has ubuntu22.04 servers, win10 and 11 desktops

Br Pekka
ID: 48755 · Report as offensive     Reply Quote
PekkaH

Send message
Joined: 23 Dec 19
Posts: 18
Credit: 43,743,882
RAC: 20,445
Message 48979 - Posted: 5 Dec 2023, 11:32:26 UTC - in response to Message 48755.  

Hi Again,

This problem is still active, seems that my mailservers root's mail box is full of these msg's (172937 msg in 2 months).

The latest seem to be like below (ip address & domain names obscrured):
=========
Return-Path: <root@localhost>
X-Original-To: postmaster@localhost
Delivered-To: postmaster@localhost
Received: from localhost (unknown [k.l.m.n])
by mail.x.y.z (Postfix) with SMTP id 229D911768
for <postmaster@localhost>; Tue, 5 Dec 2023 10:35:14 +0000 (UTC)
Received: by localhost (sSMTP sendmail emulation); Tue, 05 Dec 2023 11:35:12 +0100
From: "root" <root@localhost>
Date: Tue, 05 Dec 2023 11:35:12 +0100
To: root
Content-Type: text/plain; charset="ANSI_X3.4-1968"
Subject: Anacron job 'cron.daily' on localhost
X-UID: 172936
Status: O

/etc/cron.daily/cernvm-update-notification:

Failed to initialize root file catalog (16 - file catalog failure
========
and like this (obscured):
=============
Return-Path: <root@localhost>
X-Original-To: postmaster@localhost
Delivered-To: postmaster@localhost
Received: from localhost (k.l.m.n)
by mail.x.y.z (Postfix) with SMTP id 009522487E
for <postmaster@localhost>; Sat, 2 Dec 2023 14:36:02 +0000 (UTC)
Received: by localhost (sSMTP sendmail emulation); Sat, 02 Dec 2023 15:36:01 +0100
From: "root" <root@localhost>
Date: Sat, 02 Dec 2023 15:36:01 +0100
To: root
Subject: Cron <root@localhost> rsync -au --delete /home/boinc/cernvm/shared/html/job/ /var/www/html/job/
Content-Type: text/plain; charset=ANSI_X3.4-1968
Auto-Submitted: auto-generated
Precedence: bulk
X-Cron-Env: <XDG_SESSION_ID=2133>
X-Cron-Env: <XDG_RUNTIME_DIR=/run/user/0>
X-Cron-Env: <LANG=C>
X-Cron-Env: <SHELL=/bin/sh>
X-Cron-Env: <HOME=/root>
X-Cron-Env: <PATH=/usr/bin:/bin>
X-Cron-Env: <LOGNAME=root>
X-Cron-Env: <USER=root>
X-UID: 170001
Status: O

rsync: change_dir "/home/boinc/cernvm/shared/html/job" failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1178) [sender=3.1.2]
=================

I can dig for more, would be nice to get rid of these. Seems that this problem happens at least when you have vanilla ubuntu 22.04 server and host "mail" configured the in the network - then that gets flooded from lhc jobs.

Br Pekka
ID: 48979 · Report as offensive     Reply Quote

Questions and Answers : Getting started : Cron of CERNVM sends lots of e-mail messages to root@localhost on failure/error


©2024 CERN