Questions and Answers :
Getting started :
Cron of CERNVM sends lots of e-mail messages to root@localhost on failure/error
Message board moderation
Author | Message |
---|---|
Send message Joined: 4 Mar 20 Posts: 5 Credit: 2,979,499 RAC: 1,294 |
On failure/error of scheduled task execution the crond of cernvm sends lots of e-mail messages (about 150 per hour) to root@localhost. Here are some sample messages:
Is it possible to stop sending such messages? |
Send message Joined: 15 Jun 08 Posts: 2541 Credit: 254,608,838 RAC: 34,609 |
For a closer look you should - make your computers visible for other volunteers (https://lhcathome.cern.ch/lhcathome/prefs.php?subset=project) - post a link to the computer where you see got those logs from - post a link to an example task that computer has already reported - describe how/where you got the snippets from |
Send message Joined: 4 Mar 20 Posts: 5 Credit: 2,979,499 RAC: 1,294 |
For a closer look you should - Computer: https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10641093 - Task: https://lhcathome.cern.ch/lhcathome/result.php?resultid=399846006 I have local mail server which receives the messages in it's root mailbox. Here is a sample message source: Return-Path: <root@localhost> In the .VDI file of the task I've found "/persistent/etc/crond/sync-plots" cron tab causing this behavior: * * * * * root rsync -au --delete /home/boinc/cernvm/shared/html/job/ /var/www/html/job/ |
Send message Joined: 15 Jun 08 Posts: 2541 Credit: 254,608,838 RAC: 34,609 |
Looks like you use the reserved domain "example.com" in your local environment. According to RFC6761 "example.", "example.com.", "example.net." and "example.org." should never be used for that since they are reserved for documentation purpose only. See: https://www.rfc-editor.org/rfc/rfc6761.html If you need a local (sub-)domain without official delegation use the reserved domain name "home.arpa" as defined in RFC8375: https://www.rfc-editor.org/rfc/rfc8375.html |
Send message Joined: 4 Mar 20 Posts: 5 Credit: 2,979,499 RAC: 1,294 |
Sorry, I didn't mention that I've replaced all sensitive data with common ones to protect the real ones. Like "mydomain.tld" with "example.com" etc. |
Send message Joined: 2 May 07 Posts: 2244 Credit: 173,902,375 RAC: 456 |
Do you have Acronis Cyber Protect Home? You have to disable Secure, or allow Boinc-Folder in Acronis. |
Send message Joined: 15 Jun 08 Posts: 2541 Credit: 254,608,838 RAC: 34,609 |
Since you obfuscated relevant data it is not possible to give a qualified answer regarding your mail issue. In case the vdi file is broken - set LHC@home to NNT - report all tasks - do a project reset - resume work fetch Nonetheless, your logs show a couple of other weird entries. 1. Computer 10641093 reports less than 8 GB RAM. LHC@home expect at least 16 GB 2. Your tasks suffer from series of suspend/resume. This can break network transfers and puts a huge load on the IO system. 3. The computer reports 4 cores and your ATLAS tasks are configured to use all of them. Then you throttle BOINC to 90% CPU usage. VirtualBox recommends not to run VMs with more than 50 % of the available cores per VM. => 2 cores would be the limit on your computer for ATLAS VMs 4. You defined an HTTP proxy at 192.168.1.1:8080 and the log shows that socket can be contacted. Later the log shows CVMFS makes DIRECT connections. This points out either the proxy rejects connections from the client(s) or the proxy can't contact any internet servers. => check your proxy setup and your firewall. |
Send message Joined: 23 Dec 19 Posts: 18 Credit: 43,700,541 RAC: 17,479 |
I can see the same. Since 2 weeks period my mail has received ~30k msg which originate from the cern vm's. I can of course block in all hosts that the vm's are not allowed to send mails but prefer that they do not send them in 1st place. Triage so far: - all cluster hosts ip addresses are listed as mail origins. These are linux and win10/win11 boxes. - win boxes do not have mail systems so the only source can be the task vm itself (into which I do not have access) - in addition to the msg shown in the thread origin i can also see anacron msg (which I also think originates from inside task vm) Return-Path: <root@localhost> X-Original-To: postmaster@localhost Delivered-To: postmaster@localhost Received: from localhost (unknown [x.y.t.z]) by mail.dii.daa (Postfix) with SMTP id 2596D117F5 for <postmaster@localhost>; Sun, 1 Oct 2023 20:46:44 +0300 (EEST) Received: by localhost (sSMTP sendmail emulation); Sun, 01 Oct 2023 19:46:42 +0200 From: "root" <root@localhost> Date: Sun, 01 Oct 2023 19:46:42 +0200 To: root Content-Type: text/plain; charset="ANSI_X3.4-1968" Subject: Anacron job 'cron.daily' on localhost Content-Length: 112 Lines: 3 X-UID: 27 Status: OR /etc/cron.daily/cernvm-update-notification: Failed to initialize root file catalog (16 - file catalog failure) Br Pekka |
Send message Joined: 15 Jun 08 Posts: 2541 Credit: 254,608,838 RAC: 34,609 |
Let's go back to alverb's OP: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=6049&postid=48688 Inside a VM cron indeed calls "rsync" once every minute. Cron also calls "cernvm-update-notification" once every minute. In both cases there's no "MAILTO=" definition included in the cron files. /etc/crontab defines "MAILTO=root". The latter sends a mail to the local root account inside the VM under certain circumstances. Those mails must not appear outside the VM since sender and recepient are both located inside the VM: From: "root" <root@localhost> To: root It's still unclear where those mails appear since alverb's computer list doesn't show any Linux hosts. @Pekka You mentioned those mails appear for about 2 weeks. Last Theory app update was 2022-11-07. All mail addresses shown in your example look like "xyz@localhost" or plain "root" which are both valid either on the host or inside each VM. @both You may consider a project reset to get a fresh vdi file just in case the one you are using got damaged. |
Send message Joined: 4 Mar 20 Posts: 5 Credit: 2,979,499 RAC: 1,294 |
I confirm that I've found cron tasks inside the VDI files of LHC@home CernVms that are causing this behavior. In another Linux VM I've examined two copies of CernVMs .VDI files from different projects (Theory Simulation and ATLAS Simulation) which led me to these conclusions. The mails were slipping away from the Windows based PCs running LHC@home and received by our Linux based mail server (where by default is local "root" account hence <root@localhost> and alias "postmaster" hence <postmaster@localhost> pointing to "root"). I don't have Linux hosts running LHC@home so I can't confirm that they behave the same way. I think they will do so, as @PekkaH confirmed, because the applications are based on same VDI images. All this was before doing the steps suggested by @computezrmle. So I've done the following on all machines running LHC@home (one with 8 GB and one with 16 GB of RAM): - set LHC@home to "No New Tasks"; - waited all tasks to be reported; - had done project reset; - had resumed the work fetch; - set LHC@home to use no more than 50 % of the available cores. Since then both hosts had completed several ATLAS Simulation tasks without sending bulk e-mail messages. Till now there are no new tasks from the other LHC@home applications and I can't confirm if they are looking good too. Just for the test, today I've set back the usage of CPU cores to 100% again. Concerning connectivity, I have http proxy server in the network and although both PCs are allowed to connect directly to Internet the BOINC client somehow doesn't communicate correctly without explicit proxy settings. To exclude any rejects, on the proxy server I've set direct connections to hosts and URL patterns containing following Cern hosts: alice.cern.ch atlas.cern.ch atlas-condb.cern.ch atlas-nightlies.cern.ch cernvm-prod.cern.ch cvmfs-config.cern.ch grid.cern.ch lhcathome.cern.ch lhcathome-upload.cern.ch sft.cern.ch sft-nightlies.cern.ch unpacked.cern.ch I know that would be easier with the whole domain "cern.ch" but I have my considerations not to do so. I'll try to keep you informed if there are or there aren't any issues. @computezrmle thank you for your help! @PekkaH thank you for confirming that I'm not the only one with such issues! |
Send message Joined: 15 Jun 08 Posts: 2541 Credit: 254,608,838 RAC: 34,609 |
Do you use Squid and the squid.conf suggested here? https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5473 https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5474 Are you aware that none of the addresses below are host or domain names? Instead they are CVMFS repository names that just look like FQDNs. To exclude any rejects, on the proxy server I've set direct connections to hosts and URL patterns containing following Cern hosts: The only real FQDNs from your list are these: lhcathome.cern.ch lhcathome-upload.cern.ch |
Send message Joined: 4 Mar 20 Posts: 5 Credit: 2,979,499 RAC: 1,294 |
I use Squid but not in such exact manner but in more complex way. That's why I'm just fine tuning the running configuration by adding ACLs to get work better with LHC@home and others. Are you aware that none of the addresses below are host or domain names? Yes, I found that the "addresses" are not host nor domain names but part of URL (eg "http://s1cern-cvmfs.openhtc.io/cvmfs/atlas-nightlies.cern.ch/"). That's why I'm using "url_regex" but not "dstdomain". The only ACLs I've missed up are in the not caching part. Even without that part all request were direct. Regarding "mail issue" - till now there are not bulk messages but still all hosts are receiving only "ATLAS Simulation" tasks. Best Regards! |
Send message Joined: 23 Dec 19 Posts: 18 Credit: 43,700,541 RAC: 17,479 |
Hi, sorry being away for few days. - my records started from 16th sept and I could see the mails coming constantly. It is of course possible that they have been floating earlier but I do not have records on those I can configure my system back to the same setup so that I could see the mails & problem again. Hopefully I have tomorrow fresh data for you. BTW my setup has ubuntu22.04 servers, win10 and 11 desktops Br Pekka |
Send message Joined: 23 Dec 19 Posts: 18 Credit: 43,700,541 RAC: 17,479 |
Hi Again, This problem is still active, seems that my mailservers root's mail box is full of these msg's (172937 msg in 2 months). The latest seem to be like below (ip address & domain names obscrured): ========= Return-Path: <root@localhost> X-Original-To: postmaster@localhost Delivered-To: postmaster@localhost Received: from localhost (unknown [k.l.m.n]) by mail.x.y.z (Postfix) with SMTP id 229D911768 for <postmaster@localhost>; Tue, 5 Dec 2023 10:35:14 +0000 (UTC) Received: by localhost (sSMTP sendmail emulation); Tue, 05 Dec 2023 11:35:12 +0100 From: "root" <root@localhost> Date: Tue, 05 Dec 2023 11:35:12 +0100 To: root Content-Type: text/plain; charset="ANSI_X3.4-1968" Subject: Anacron job 'cron.daily' on localhost X-UID: 172936 Status: O /etc/cron.daily/cernvm-update-notification: Failed to initialize root file catalog (16 - file catalog failure ======== and like this (obscured): ============= Return-Path: <root@localhost> X-Original-To: postmaster@localhost Delivered-To: postmaster@localhost Received: from localhost (k.l.m.n) by mail.x.y.z (Postfix) with SMTP id 009522487E for <postmaster@localhost>; Sat, 2 Dec 2023 14:36:02 +0000 (UTC) Received: by localhost (sSMTP sendmail emulation); Sat, 02 Dec 2023 15:36:01 +0100 From: "root" <root@localhost> Date: Sat, 02 Dec 2023 15:36:01 +0100 To: root Subject: Cron <root@localhost> rsync -au --delete /home/boinc/cernvm/shared/html/job/ /var/www/html/job/ Content-Type: text/plain; charset=ANSI_X3.4-1968 Auto-Submitted: auto-generated Precedence: bulk X-Cron-Env: <XDG_SESSION_ID=2133> X-Cron-Env: <XDG_RUNTIME_DIR=/run/user/0> X-Cron-Env: <LANG=C> X-Cron-Env: <SHELL=/bin/sh> X-Cron-Env: <HOME=/root> X-Cron-Env: <PATH=/usr/bin:/bin> X-Cron-Env: <LOGNAME=root> X-Cron-Env: <USER=root> X-UID: 170001 Status: O rsync: change_dir "/home/boinc/cernvm/shared/html/job" failed: No such file or directory (2) rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1178) [sender=3.1.2] ================= I can dig for more, would be nice to get rid of these. Seems that this problem happens at least when you have vanilla ubuntu 22.04 server and host "mail" configured the in the network - then that gets flooded from lhc jobs. Br Pekka |
©2024 CERN