Cron of CERNVM sends lots of e-mail messages to root@localhost on failure/error

Author	Message
alverb Send message Joined: 4 Mar 20 Posts: 5 Credit: 3,635,852 RAC: 3,119	Message 48688 - Posted: 29 Sep 2023, 12:10:41 UTC On failure/error of scheduled task execution the crond of cernvm sends lots of e-mail messages (about 150 per hour) to root@localhost. Here are some sample messages: From: "root" <root@localhost> To: root Subject: Cron <root@localhost> rsync -au --delete /home/boinc/cernvm/shared/html/job/ /var/www/html/job/ rsync: change_dir "/home/boinc/cernvm/shared/html/job" failed: No such file or directory (2) rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1178) [sender=3.1.2] From: "root" <root@localhost> To: root Subject: Anacron job 'cron.daily' on localhost /etc/cron.daily/cernvm-update-notification: Failed to initialize root file catalog (16 - file catalog failure) Is it possible to stop sending such messages? ID: 48688 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2683 Credit: 286,880,627 RAC: 56,509	Message 48689 - Posted: 29 Sep 2023, 12:44:59 UTC - in response to Message 48688. For a closer look you should - make your computers visible for other volunteers (https://lhcathome.cern.ch/lhcathome/prefs.php?subset=project) - post a link to the computer where you see got those logs from - post a link to an example task that computer has already reported - describe how/where you got the snippets from ID: 48689 · Reply Quote

alverb Send message Joined: 4 Mar 20 Posts: 5 Credit: 3,635,852 RAC: 3,119	Message 48693 - Posted: 29 Sep 2023, 14:02:23 UTC - in response to Message 48689. For a closer look you should - make your computers visible for other volunteers (https://lhcathome.cern.ch/lhcathome/prefs.php?subset=project) - post a link to the computer where you see got those logs from - post a link to an example task that computer has already reported - describe how/where you got the snippets from - Computer: https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10641093 - Task: https://lhcathome.cern.ch/lhcathome/result.php?resultid=399846006 I have local mail server which receives the messages in it's root mailbox. Here is a sample message source: Return-Path: <root@localhost> Delivered-To: admin@example.com Received: from localhost (localhost [127.0.0.1]) by mail.example.com (mail) with ESMTP id ID for <postmaster@localhost>; Fri, 29 Sep 2023 13:33:43 +0300 (EEST) X-Virus-Scanned: amavis Received: from mail.example.com ([127.0.0.1]) by localhost (mail.example.com [127.0.0.1]) (amavis, port 10024) with ESMTP id ID for <postmaster@localhost>; Fri, 29 Sep 2023 13:33:43 +0300 (EEST) Received: from localhost (unknown [1.2.3.4]) by mail.example.com (mail) with SMTP id ID for <postmaster@localhost>; Fri, 29 Sep 2023 13:33:42 +0300 (EEST) Received: by localhost (sSMTP sendmail emulation); Fri, 29 Sep 2023 13:33:01 +0300 From: "root" <root@localhost> Date: Fri, 29 Sep 2023 13:33:01 +0300 To: root Subject: Cron <root@localhost> rsync -au --delete /home/boinc/cernvm/shared/html/job/ /var/www/html/job/ Content-Type: text/plain; charset=ANSI_X3.4-1968 Auto-Submitted: auto-generated Precedence: bulk X-Cron-Env: <XDG_SESSION_ID=108> X-Cron-Env: <XDG_RUNTIME_DIR=/run/user/0> X-Cron-Env: <LANG=C> X-Cron-Env: <SHELL=/bin/sh> X-Cron-Env: <HOME=/root> X-Cron-Env: <PATH=/usr/bin:/bin> X-Cron-Env: <LOGNAME=root> X-Cron-Env: <USER=root> rsync: change_dir "/home/boinc/cernvm/shared/html/job" failed: No such file or directory (2) rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1178) [sender=3.1.2] In the .VDI file of the task I've found "/persistent/etc/crond/sync-plots" cron tab causing this behavior: * * * * * root rsync -au --delete /home/boinc/cernvm/shared/html/job/ /var/www/html/job/ ID: 48693 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2683 Credit: 286,880,627 RAC: 56,509	Message 48694 - Posted: 29 Sep 2023, 16:13:52 UTC - in response to Message 48693. Looks like you use the reserved domain "example.com" in your local environment. According to RFC6761 "example.", "example.com.", "example.net." and "example.org." should never be used for that since they are reserved for documentation purpose only. See: https://www.rfc-editor.org/rfc/rfc6761.html If you need a local (sub-)domain without official delegation use the reserved domain name "home.arpa" as defined in RFC8375: https://www.rfc-editor.org/rfc/rfc8375.html ID: 48694 · Reply Quote

alverb Send message Joined: 4 Mar 20 Posts: 5 Credit: 3,635,852 RAC: 3,119	Message 48701 - Posted: 30 Sep 2023, 2:01:01 UTC - in response to Message 48694. Sorry, I didn't mention that I've replaced all sensitive data with common ones to protect the real ones. Like "mydomain.tld" with "example.com" etc. ID: 48701 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2277 Credit: 178,709,076 RAC: 100,489	Message 48702 - Posted: 30 Sep 2023, 2:43:55 UTC - in response to Message 48701. Do you have Acronis Cyber Protect Home? You have to disable Secure, or allow Boinc-Folder in Acronis. ID: 48702 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2683 Credit: 286,880,627 RAC: 56,509	Message 48706 - Posted: 30 Sep 2023, 7:45:42 UTC - in response to Message 48701. Since you obfuscated relevant data it is not possible to give a qualified answer regarding your mail issue. In case the vdi file is broken - set LHC@home to NNT - report all tasks - do a project reset - resume work fetch Nonetheless, your logs show a couple of other weird entries. 1. Computer 10641093 reports less than 8 GB RAM. LHC@home expect at least 16 GB 2. Your tasks suffer from series of suspend/resume. This can break network transfers and puts a huge load on the IO system. 3. The computer reports 4 cores and your ATLAS tasks are configured to use all of them. Then you throttle BOINC to 90% CPU usage. VirtualBox recommends not to run VMs with more than 50 % of the available cores per VM. => 2 cores would be the limit on your computer for ATLAS VMs 4. You defined an HTTP proxy at 192.168.1.1:8080 and the log shows that socket can be contacted. Later the log shows CVMFS makes DIRECT connections. This points out either the proxy rejects connections from the client(s) or the proxy can't contact any internet servers. => check your proxy setup and your firewall. ID: 48706 · Reply Quote

PekkaH Send message Joined: 23 Dec 19 Posts: 18 Credit: 52,962,068 RAC: 33,794	Message 48717 - Posted: 1 Oct 2023, 18:53:58 UTC - in response to Message 48694. I can see the same. Since 2 weeks period my mail has received ~30k msg which originate from the cern vm's. I can of course block in all hosts that the vm's are not allowed to send mails but prefer that they do not send them in 1st place. Triage so far: - all cluster hosts ip addresses are listed as mail origins. These are linux and win10/win11 boxes. - win boxes do not have mail systems so the only source can be the task vm itself (into which I do not have access) - in addition to the msg shown in the thread origin i can also see anacron msg (which I also think originates from inside task vm) Return-Path: <root@localhost> X-Original-To: postmaster@localhost Delivered-To: postmaster@localhost Received: from localhost (unknown [x.y.t.z]) by mail.dii.daa (Postfix) with SMTP id 2596D117F5 for <postmaster@localhost>; Sun, 1 Oct 2023 20:46:44 +0300 (EEST) Received: by localhost (sSMTP sendmail emulation); Sun, 01 Oct 2023 19:46:42 +0200 From: "root" <root@localhost> Date: Sun, 01 Oct 2023 19:46:42 +0200 To: root Content-Type: text/plain; charset="ANSI_X3.4-1968" Subject: Anacron job 'cron.daily' on localhost Content-Length: 112 Lines: 3 X-UID: 27 Status: OR /etc/cron.daily/cernvm-update-notification: Failed to initialize root file catalog (16 - file catalog failure) Br Pekka ID: 48717 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2683 Credit: 286,880,627 RAC: 56,509	Message 48731 - Posted: 3 Oct 2023, 8:50:34 UTC Let's go back to alverb's OP: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=6049&postid=48688 Inside a VM cron indeed calls "rsync" once every minute. Cron also calls "cernvm-update-notification" once every minute. In both cases there's no "MAILTO=" definition included in the cron files. /etc/crontab defines "MAILTO=root". The latter sends a mail to the local root account inside the VM under certain circumstances. Those mails must not appear outside the VM since sender and recepient are both located inside the VM: From: "root" <root@localhost> To: root It's still unclear where those mails appear since alverb's computer list doesn't show any Linux hosts. @Pekka You mentioned those mails appear for about 2 weeks. Last Theory app update was 2022-11-07. All mail addresses shown in your example look like "xyz@localhost" or plain "root" which are both valid either on the host or inside each VM. @both You may consider a project reset to get a fresh vdi file just in case the one you are using got damaged. ID: 48731 · Reply Quote

alverb Send message Joined: 4 Mar 20 Posts: 5 Credit: 3,635,852 RAC: 3,119	Message 48735 - Posted: 3 Oct 2023, 13:17:13 UTC - in response to Message 48731. I confirm that I've found cron tasks inside the VDI files of LHC@home CernVms that are causing this behavior. In another Linux VM I've examined two copies of CernVMs .VDI files from different projects (Theory Simulation and ATLAS Simulation) which led me to these conclusions. The mails were slipping away from the Windows based PCs running LHC@home and received by our Linux based mail server (where by default is local "root" account hence <root@localhost> and alias "postmaster" hence <postmaster@localhost> pointing to "root"). I don't have Linux hosts running LHC@home so I can't confirm that they behave the same way. I think they will do so, as @PekkaH confirmed, because the applications are based on same VDI images. All this was before doing the steps suggested by @computezrmle. So I've done the following on all machines running LHC@home (one with 8 GB and one with 16 GB of RAM): - set LHC@home to "No New Tasks"; - waited all tasks to be reported; - had done project reset; - had resumed the work fetch; - set LHC@home to use no more than 50 % of the available cores. Since then both hosts had completed several ATLAS Simulation tasks without sending bulk e-mail messages. Till now there are no new tasks from the other LHC@home applications and I can't confirm if they are looking good too. Just for the test, today I've set back the usage of CPU cores to 100% again. Concerning connectivity, I have http proxy server in the network and although both PCs are allowed to connect directly to Internet the BOINC client somehow doesn't communicate correctly without explicit proxy settings. To exclude any rejects, on the proxy server I've set direct connections to hosts and URL patterns containing following Cern hosts: alice.cern.ch atlas.cern.ch atlas-condb.cern.ch atlas-nightlies.cern.ch cernvm-prod.cern.ch cvmfs-config.cern.ch grid.cern.ch lhcathome.cern.ch lhcathome-upload.cern.ch sft.cern.ch sft-nightlies.cern.ch unpacked.cern.ch I know that would be easier with the whole domain "cern.ch" but I have my considerations not to do so. I'll try to keep you informed if there are or there aren't any issues. @computezrmle thank you for your help! @PekkaH thank you for confirming that I'm not the only one with such issues! ID: 48735 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2683 Credit: 286,880,627 RAC: 56,509	Message 48742 - Posted: 3 Oct 2023, 14:47:38 UTC - in response to Message 48735. Do you use Squid and the squid.conf suggested here? https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5473 https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5474 Are you aware that none of the addresses below are host or domain names? Instead they are CVMFS repository names that just look like FQDNs. To exclude any rejects, on the proxy server I've set direct connections to hosts and URL patterns containing following Cern hosts: alice.cern.ch atlas.cern.ch atlas-condb.cern.ch atlas-nightlies.cern.ch cernvm-prod.cern.ch cvmfs-config.cern.ch grid.cern.ch sft.cern.ch sft-nightlies.cern.ch unpacked.cern.ch The only real FQDNs from your list are these: lhcathome.cern.ch lhcathome-upload.cern.ch ID: 48742 · Reply Quote

alverb Send message Joined: 4 Mar 20 Posts: 5 Credit: 3,635,852 RAC: 3,119	Message 48747 - Posted: 4 Oct 2023, 6:59:42 UTC - in response to Message 48742. I use Squid but not in such exact manner but in more complex way. That's why I'm just fine tuning the running configuration by adding ACLs to get work better with LHC@home and others. Are you aware that none of the addresses below are host or domain names? Instead they are CVMFS repository names that just look like FQDNs. Yes, I found that the "addresses" are not host nor domain names but part of URL (eg "http://s1cern-cvmfs.openhtc.io/cvmfs/atlas-nightlies.cern.ch/"). That's why I'm using "url_regex" but not "dstdomain". The only ACLs I've missed up are in the not caching part. Even without that part all request were direct. Regarding "mail issue" - till now there are not bulk messages but still all hosts are receiving only "ATLAS Simulation" tasks. Best Regards! ID: 48747 · Reply Quote

PekkaH Send message Joined: 23 Dec 19 Posts: 18 Credit: 52,962,068 RAC: 33,794	Message 48755 - Posted: 5 Oct 2023, 15:19:05 UTC - in response to Message 48731. Hi, sorry being away for few days. - my records started from 16th sept and I could see the mails coming constantly. It is of course possible that they have been floating earlier but I do not have records on those I can configure my system back to the same setup so that I could see the mails & problem again. Hopefully I have tomorrow fresh data for you. BTW my setup has ubuntu22.04 servers, win10 and 11 desktops Br Pekka ID: 48755 · Reply Quote

PekkaH Send message Joined: 23 Dec 19 Posts: 18 Credit: 52,962,068 RAC: 33,794	Message 48979 - Posted: 5 Dec 2023, 11:32:26 UTC - in response to Message 48755. Hi Again, This problem is still active, seems that my mailservers root's mail box is full of these msg's (172937 msg in 2 months). The latest seem to be like below (ip address & domain names obscrured): ========= Return-Path: <root@localhost> X-Original-To: postmaster@localhost Delivered-To: postmaster@localhost Received: from localhost (unknown [k.l.m.n]) by mail.x.y.z (Postfix) with SMTP id 229D911768 for <postmaster@localhost>; Tue, 5 Dec 2023 10:35:14 +0000 (UTC) Received: by localhost (sSMTP sendmail emulation); Tue, 05 Dec 2023 11:35:12 +0100 From: "root" <root@localhost> Date: Tue, 05 Dec 2023 11:35:12 +0100 To: root Content-Type: text/plain; charset="ANSI_X3.4-1968" Subject: Anacron job 'cron.daily' on localhost X-UID: 172936 Status: O /etc/cron.daily/cernvm-update-notification: Failed to initialize root file catalog (16 - file catalog failure ======== and like this (obscured): ============= Return-Path: <root@localhost> X-Original-To: postmaster@localhost Delivered-To: postmaster@localhost Received: from localhost (k.l.m.n) by mail.x.y.z (Postfix) with SMTP id 009522487E for <postmaster@localhost>; Sat, 2 Dec 2023 14:36:02 +0000 (UTC) Received: by localhost (sSMTP sendmail emulation); Sat, 02 Dec 2023 15:36:01 +0100 From: "root" <root@localhost> Date: Sat, 02 Dec 2023 15:36:01 +0100 To: root Subject: Cron <root@localhost> rsync -au --delete /home/boinc/cernvm/shared/html/job/ /var/www/html/job/ Content-Type: text/plain; charset=ANSI_X3.4-1968 Auto-Submitted: auto-generated Precedence: bulk X-Cron-Env: <XDG_SESSION_ID=2133> X-Cron-Env: <XDG_RUNTIME_DIR=/run/user/0> X-Cron-Env: <LANG=C> X-Cron-Env: <SHELL=/bin/sh> X-Cron-Env: <HOME=/root> X-Cron-Env: <PATH=/usr/bin:/bin> X-Cron-Env: <LOGNAME=root> X-Cron-Env: <USER=root> X-UID: 170001 Status: O rsync: change_dir "/home/boinc/cernvm/shared/html/job" failed: No such file or directory (2) rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1178) [sender=3.1.2] ================= I can dig for more, would be nice to get rid of these. Seems that this problem happens at least when you have vanilla ubuntu 22.04 server and host "mail" configured the in the network - then that gets flooded from lhc jobs. Br Pekka ID: 48979 · Reply Quote

LHC@home