Message boards :
CMS Application :
CMS Tasks Failing
Message board moderation
Previous · 1 . . . 13 · 14 · 15 · 16 · 17 · 18 · 19 . . . 22 · Next
Author | Message |
---|---|
Send message Joined: 24 Oct 04 Posts: 1173 Credit: 54,858,272 RAC: 16,639 |
YES that is hard to believe and what usually happens is members see that and then d/l a batch on several computers and then just try a couple and find they fail and most of the time ran for a few hours before failing and then they suspend them and wonder what happened. But 7,931 running is hard to believe even when they are working. |
Send message Joined: 18 Dec 15 Posts: 1815 Credit: 118,586,926 RAC: 35,258 |
what I see from the Server Status page, the number of "users in the past 24 hours" has dropped to 79 (a week ago, it was, if I remember correctly, about 150). Could this mean that the CMS tasks are still failing? Would so far really nobody have taken care of the problem? |
Send message Joined: 15 Jun 08 Posts: 2534 Credit: 253,985,651 RAC: 44,497 |
A CMS task represents nothing but an envelope to run CMS subtasks (= jobs). The BOINC server page shows some statistics about those envelopes. Subtask statistics can be seen at the grafana pages, e.g.: https://monit-grafana.cern.ch/d/000000628/cms-job-monitoring?orgId=11&from=now-3d&to=now-12m&refresh=15m&var-group_by=CMS_JobType&var-Tier=All&var-Site=T3_CH_Volunteer&var-Type=All&var-CMS_JobType=All&var-CMSPrimaryDataTier=All&var-binning=1h&var-measurement=condor_1h&var-retention_policy=long If you don't have an original CERN account you may use your public account from facebook or google to log in. Grafana shows that ATM 160 subtasks are running. Out of experience this is a normal value. The numbers are usually (much) higher - tasks as well as subtasks and users - when other LHC subprojects dry out and volunteers who allow more of them get CMS instead of other subprojects. |
Send message Joined: 18 Dec 15 Posts: 1815 Credit: 118,586,926 RAC: 35,258 |
after some waiting time, I tried CMS again, just to find out that the tasks are still failing: https://lhcathome.cern.ch/lhcathome/result.php?resultid=259203295 this task failed after 5 hours 11 minutes: 1 (0x00000001) Unknown error code 2020-01-14 13:51:13 (19508): Guest Log: [ERROR] Condor ended after 18039 seconds. After this problem now exists for quite some time, I am wondering why no one is fixing it. BTW, havn't heard from Ivan for long time - is he no longer on bord? |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 |
Thanks. I have a machine that will be available in a few days. It looks like it should go on ATLAS. |
Send message Joined: 15 Jun 08 Posts: 2534 Credit: 253,985,651 RAC: 44,497 |
Grafana shows a failure peak between 13:00 and 14:00 CET today. At present the failure rate is down to 2.5 % and the #running jobs is 321. My own tasks show an error rate of 0.4 % since 2020-01-06. Rather good for a complex app like CMS, isn't it? |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 |
My own tasks show an error rate of 0.4 % since 2020-01-06. Yes, that is good enough for me. Fortunately, you can now run native ATLAS along with CMS, so one can serve as the backup for the other. |
Send message Joined: 14 Jan 10 Posts: 1419 Credit: 9,470,934 RAC: 2,905 |
CMS-tasks fail due to not getting X509 credentials. https://lhcathome.cern.ch/lhcathome/result.php?resultid=272193767 |
Send message Joined: 14 Jan 20 Posts: 2 Credit: 29,211 RAC: 0 |
Same for me (win / vbox), CMS tasks fail with X509 credential error |
Send message Joined: 15 Jun 08 Posts: 2534 Credit: 253,985,651 RAC: 44,497 |
Same for me (win / vbox), CMS tasks fail with X509 credential error Beside the fact that the project team has not officially announced CMS being stable again this setting might be changed: https://lhcathome.cern.ch/lhcathome/result.php?resultid=272197689 2020-05-01 12:16:48 (7752): Setting CPU throttle for VM. (60%) Simple native BOINC apps without permanent network access usually can be throttled to a very low percentage. Complex apps like that from LHC@home using VirtualBox should run at 100 % whenever possible. This avoids timing problems on various levels. To reduce the total load on a computer or work against temperature/noise problems it would be better to reduce the #cores used by BOINC. |
Send message Joined: 11 Jul 19 Posts: 7 Credit: 1,736,153 RAC: 190 |
I'm facing the same issue. All WUs from "CMS Simulation v50.00 (vbox64) windows_x86_64" application fail. https://lhcathome.cern.ch/lhcathome/result.php?resultid=272165131 ERROR: Couldn't read proxy from: /tmp/x509up_u0 globus_credential: Error reading proxy credential globus_credential: Error reading proxy credential: Couldn't read PEM from bio OpenSSL Error: pem_lib.c:703: in library: PEM routines, function PEM_read_bio: no start line Guest Log: Use -debug for further information. [ERROR] Could not get an x509 credential [ERROR] The x509 proxy creation failed. |
Send message Joined: 11 Jul 19 Posts: 7 Credit: 1,736,153 RAC: 190 |
Today the error is about Condor https://lhcathome.cern.ch/lhcathome/result.php?resultid=272679370 Guest Log: [INFO] CMS application starting. Check log files. Guest Log: [DEBUG] HTCondor ping Guest Log: [DEBUG] 0 Guest Log: [ERROR] Condor ended after 1324 seconds. Guest Log: [INFO] Shutting Down. |
Send message Joined: 9 Sep 19 Posts: 32 Credit: 2,856,470 RAC: 0 |
today the cms tasks are only failing I looked the resume The tasks should be bad created as multicore but are running as single core and failing |
Send message Joined: 9 Sep 19 Posts: 32 Credit: 2,856,470 RAC: 0 |
at the programmer should understand what is failing: <core_client_version>7.16.5</core_client_version> <![CDATA[ <message> La pile de l - exit code 207 (0xcf)</message> <stderr_txt> 2020-05-10 23:53:19 (157820): Detected: vboxwrapper 26197 2020-05-10 23:53:19 (157820): Detected: BOINC client v7.7 2020-05-10 23:53:20 (157820): Detected: VirtualBox VboxManage Interface (Version: 6.0.14) 2020-05-10 23:53:20 (157820): Detected: Heartbeat check (file: 'heartbeat' every 1200.000000 seconds) 2020-05-10 23:53:20 (157820): Successfully copied 'init_data.xml' to the shared directory. 2020-05-10 23:53:22 (157820): Create VM. (boinc_312d6963e4768bbf, slot#6) 2020-05-10 23:53:22 (157820): Setting Memory Size for VM. (2048MB) 2020-05-10 23:53:22 (157820): Setting CPU Count for VM. (1) 2020-05-10 23:53:23 (157820): Setting Chipset Options for VM. 2020-05-10 23:53:23 (157820): Setting Boot Options for VM. 2020-05-10 23:53:23 (157820): Setting Network Configuration for NAT. 2020-05-10 23:53:23 (157820): Enabling VM Network Access. 2020-05-10 23:53:24 (157820): Disabling USB Support for VM. 2020-05-10 23:53:25 (157820): Disabling COM Port Support for VM. 2020-05-10 23:53:25 (157820): Disabling LPT Port Support for VM. 2020-05-10 23:53:25 (157820): Disabling Audio Support for VM. 2020-05-10 23:53:25 (157820): Disabling Clipboard Support for VM. 2020-05-10 23:53:26 (157820): Disabling Drag and Drop Support for VM. 2020-05-10 23:53:26 (157820): Adding storage controller(s) to VM. 2020-05-10 23:53:26 (157820): Adding virtual disk drive to VM. (vm_image.vdi) 2020-05-10 23:53:27 (157820): Adding VirtualBox Guest Additions to VM. 2020-05-10 23:53:27 (157820): Adding network bandwidth throttle group to VM. (Defaulting to 1024GB) 2020-05-10 23:53:27 (157820): forwarding host port 65246 to guest port 80 2020-05-10 23:53:27 (157820): Enabling remote desktop for VM. 2020-05-10 23:53:28 (157820): Enabling shared directory for VM. 2020-05-10 23:53:28 (157820): Starting VM using VBoxManage interface. (boinc_312d6963e4768bbf, slot#6) 2020-05-10 23:53:33 (157820): Successfully started VM. (PID = '142712') 2020-05-10 23:53:33 (157820): Reporting VM Process ID to BOINC. 2020-05-10 23:53:33 (157820): Guest Log: BIOS: VirtualBox 6.0.14 2020-05-10 23:53:33 (157820): Guest Log: CPUID EDX: 0x178bfbff 2020-05-10 23:53:33 (157820): Guest Log: BIOS: ata0-0: PCHS=16383/16/63 LCHS=1024/255/63 2020-05-10 23:53:33 (157820): VM state change detected. (old = 'PoweredOff', new = 'Running') 2020-05-10 23:53:33 (157820): Detected: Web Application Enabled (http://localhost:65246) 2020-05-10 23:53:33 (157820): Detected: Remote Desktop Enabled (localhost:65247) 2020-05-10 23:53:33 (157820): Preference change detected 2020-05-10 23:53:33 (157820): Setting CPU throttle for VM. (80%) 2020-05-10 23:53:34 (157820): Setting checkpoint interval to 600 seconds. (Higher value of (Preference: 60 seconds) or (Vbox_job.xml: 600 seconds)) 2020-05-10 23:53:35 (157820): Guest Log: BIOS: Boot : bseqnr=1, bootseq=0032 2020-05-10 23:53:35 (157820): Guest Log: BIOS: Booting from Hard Disk... 2020-05-10 23:53:37 (157820): Guest Log: BIOS: KBD: unsupported int 16h function 03 2020-05-10 23:53:37 (157820): Guest Log: BIOS: AX=0305 BX=0000 CX=0000 DX=0000 2020-05-10 23:53:52 (157820): Guest Log: vgdrvHeartbeatInit: Setting up heartbeat to trigger every 2000 milliseconds 2020-05-10 23:53:52 (157820): Guest Log: vboxguest: misc device minor 56, IRQ 20, I/O port d020, MMIO at 00000000f0400000 (size 0x400000) 2020-05-10 23:54:14 (157820): Guest Log: VBoxService 5.2.6 r120293 (verbosity: 0) linux.amd64 (Jan 15 2018 14:51:00) release log 2020-05-10 23:54:14 (157820): Guest Log: 00:00:00.000125 main Log opened 2020-05-10T21:54:14.490535000Z 2020-05-10 23:54:14 (157820): Guest Log: 00:00:00.000275 main OS Product: Linux 2020-05-10 23:54:14 (157820): Guest Log: 00:00:00.000326 main OS Release: 4.14.157-17.cernvm.x86_64 2020-05-10 23:54:14 (157820): Guest Log: 00:00:00.000348 main OS Version: #1 SMP Wed Dec 4 17:26:45 CET 2019 2020-05-10 23:54:14 (157820): Guest Log: 00:00:00.000367 main Executable: /usr/share/vboxguest52/usr/sbin/VBoxService 2020-05-10 23:54:14 (157820): Guest Log: 00:00:00.000368 main Process ID: 2948 2020-05-10 23:54:14 (157820): Guest Log: 00:00:00.000368 main Package type: LINUX_64BITS_GENERIC 2020-05-10 23:54:14 (157820): Guest Log: 00:00:00.001838 main 5.2.6 r120293 started. Verbose level = 0 2020-05-10 23:54:25 (157820): Guest Log: [INFO] Mounting the shared directory 2020-05-10 23:54:25 (157820): Guest Log: [INFO] Shared directory mounted, enabling vboxmonitor 2020-05-10 23:54:25 (157820): Guest Log: [DEBUG] Testing network connection to cern.ch on port 80 2020-05-10 23:54:25 (157820): Guest Log: [DEBUG] Connection to cern.ch 80 port [tcp/http] succeeded! 2020-05-10 23:54:25 (157820): Guest Log: [DEBUG] 0 2020-05-10 23:54:25 (157820): Guest Log: [DEBUG] Testing VCCS connection to vccs.cern.ch on port 443 2020-05-10 23:54:25 (157820): Guest Log: [DEBUG] Connection to vccs.cern.ch 443 port [tcp/https] succeeded! 2020-05-10 23:54:25 (157820): Guest Log: [DEBUG] 0 2020-05-10 23:54:25 (157820): Guest Log: [DEBUG] Testing connection to Condor server on port 9618 2020-05-10 23:54:25 (157820): Guest Log: [DEBUG] Connection to vocms0840.cern.ch 9618 port [tcp/condor] succeeded! 2020-05-10 23:54:26 (157820): Guest Log: [DEBUG] 0 2020-05-10 23:55:28 (157820): Guest Log: [DEBUG] Probing CVMFS ... 2020-05-10 23:55:29 (157820): Guest Log: Probing /cvmfs/grid.cern.ch... OK 2020-05-10 23:55:29 (157820): Guest Log: VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE 2020-05-10 23:55:29 (157820): Guest Log: 2.4.4.0 3713 1 25848 12197 4 1 1242455 4096000 2 65024 0 3 100 0 0 http://s1cern-cvmfs.openhtc.io/cvmfs/grid.cern.ch DIRECT 1 2020-05-10 23:55:33 (157820): Guest Log: [INFO] Reading volunteer information 2020-05-10 23:55:33 (157820): Guest Log: [INFO] Volunteer: Guy PF Masevaux (589052) 2020-05-10 23:55:33 (157820): Guest Log: [INFO] VMID: 49b2fac1-df25-48d2-a4ee-4612ca6a31f8 2020-05-10 23:55:34 (157820): Guest Log: [INFO] Requesting an X509 credential from LHC@home 2020-05-10 23:55:34 (157820): Guest Log: [INFO] Running the fast benchmark. 2020-05-10 23:55:59 (157820): Guest Log: [INFO] Machine performance 20.11 HEPSPEC06 2020-05-10 23:55:59 (157820): Guest Log: [INFO] CMS application starting. Check log files. 2020-05-10 23:56:00 (157820): Guest Log: [DEBUG] HTCondor ping 2020-05-10 23:56:01 (157820): Guest Log: [DEBUG] 0 2020-05-11 00:06:26 (157820): Guest Log: Did the tarball get created? 2020-05-11 00:06:26 (157820): Guest Log: /tmp/CMS_175225_1589144489.909597_0.tgz 2020-05-11 00:06:26 (157820): Guest Log: Here is the upload output 2020-05-11 00:06:27 (157820): Guest Log: Here is the upload error 2020-05-11 00:06:27 (157820): Guest Log: Here is the condor directory 2020-05-11 00:06:27 (157820): Guest Log: MasterLog 2020-05-11 00:06:27 (157820): Guest Log: ProcLog 2020-05-11 00:06:27 (157820): Guest Log: StarterLog 2020-05-11 00:06:27 (157820): Guest Log: StartLog 2020-05-11 00:06:27 (157820): Guest Log: XferStatsLog 2020-05-11 00:06:27 (157820): Guest Log: Here is the MasterLog 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:01 ****************************************************** 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:01 ** condor_master (CONDOR_MASTER) STARTING UP 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:01 ** /usr/sbin/condor_master 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:01 ** SubsystemInfo: name=MASTER type=MASTER(2) class=DAEMON(1) 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:01 ** Configuration: subsystem:MASTER local:<NONE> class:DAEMON 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:01 ** $CondorVersion: 8.6.10 Mar 12 2018 BuildID: 435200 $ 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:01 ** $CondorPlatform: x86_64_RedHat6 $ 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:01 ** PID = 4695 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:01 ** Log last touched time unavailable (No such file or directory) 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:01 ****************************************************** 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:01 Using config source: /etc/condor/condor_config 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:01 Using local config sources: 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:01 /etc/condor/config.d/10_security.config 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:01 /etc/condor/config.d/14_network.config 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:01 /etc/condor/config.d/20_workernode.config 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:01 /etc/condor/config.d/30_lease.config 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:01 /etc/condor/config.d/35_cms.config 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:01 /etc/condor/config.d/40_ccb.config 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:01 /etc/condor/config.d/62-benchmark.conf 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:01 /etc/condor/condor_config.local 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:01 config Macros = 170, Sorted = 170, StringBytes = 6830, TablesBytes = 6224 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:01 CLASSAD_CACHING is OFF 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:01 Daemon Log is logging: D_ALWAYS D_ERROR 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:01 Daemoncore: Listening at <10.0.2.15:43927> on TCP (ReliSock). 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:01 DaemonCore: command socket at <10.0.2.15:43927?addrs=10.0.2.15-43927&noUDP> 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:01 DaemonCore: private command socket at <10.0.2.15:43927?addrs=10.0.2.15-43927> 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:16 CCBListener: registered with CCB server vocms0840.cern.ch as ccbid 137.138.156.85:9618?addrs=137.138.156.85-9618#2081158 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:16 Master restart (GRACEFUL) is watching /usr/sbin/condor_master (mtime:1520893905) 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:17 Started DaemonCore process "/usr/sbin/condor_startd", pid and pgroup = 10244 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:27 Setting ready state 'Ready' for STARTD 2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 Got SIGTERM. Performing graceful shutdown. 2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 Sent SIGTERM to STARTD (pid 10244) 2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 AllReaper unexpectedly called on pid 10244, status 0. 2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 The STARTD (pid 10244) exited with status 0 2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 All daemons are gone. Exiting. 2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 **** condor_master (condor_MASTER) pid 4695 EXITING WITH STATUS 0 2020-05-11 00:06:27 (157820): Guest Log: Here is the KernelTuning.log 2020-05-11 00:06:27 (157820): Guest Log: Here is the StartLog 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:17 ****************************************************** 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:17 ** condor_startd (CONDOR_STARTD) STARTING UP 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:17 ** /usr/sbin/condor_startd 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:17 ** SubsystemInfo: name=STARTD type=STARTD(7) class=DAEMON(1) 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:17 ** Configuration: subsystem:STARTD local:<NONE> class:DAEMON 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:17 ** $CondorVersion: 8.6.10 Mar 12 2018 BuildID: 435200 $ 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:17 ** $CondorPlatform: x86_64_RedHat6 $ 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:17 ** PID = 10244 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:17 ** Log last touched time unavailable (No such file or directory) 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:17 ****************************************************** 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:17 Using config source: /etc/condor/condor_config 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:17 Using local config sources: 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:17 /etc/condor/config.d/10_security.config 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:17 /etc/condor/config.d/14_network.config 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:17 /etc/condor/config.d/20_workernode.config 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:17 /etc/condor/config.d/30_lease.config 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:17 /etc/condor/config.d/35_cms.config 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:17 /etc/condor/config.d/40_ccb.config 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:17 /etc/condor/config.d/62-benchmark.conf 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:17 /etc/condor/condor_config.local 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:17 config Macros = 171, Sorted = 171, StringBytes = 6856, TablesBytes = 6260 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:17 CLASSAD_CACHING is ENABLED 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:17 Daemon Log is logging: D_ALWAYS D_ERROR 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:17 Daemoncore: Listening at <10.0.2.15:41863> on TCP (ReliSock). 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:17 DaemonCore: command socket at <10.0.2.15:41863?addrs=10.0.2.15-41863&noUDP> 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:17 DaemonCore: private command socket at <10.0.2.15:41863?addrs=10.0.2.15-41863> 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:22 CCBListener: registered with CCB server vocms0840.cern.ch as ccbid 137.138.156.85:9618?addrs=137.138.156.85-9618#2081160 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:24 VM-gahp server reported an internal error 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:24 VM universe will be tested to check if it is available 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:24 History file rotation is enabled. 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:24 Maximum history file size is: 20971520 bytes 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:24 Number of rotated history files is: 2 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:24 Allocating auto shares for slot type 0: Cpus: auto, Memory: auto, Swap: auto, Disk: auto 2020-05-11 00:06:27 (157820): Guest Log: slot type 0: Cpus: 1.000000, Memory: 3000, Swap: 100.00%, Disk: 100.00% 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:24 New machine resource allocated 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:24 Setting up slot pairings 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:24 CronJobList: Adding job 'multicore' 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:24 CronJob: Initializing job 'multicore' (/usr/local/bin/multicore-shutdown) 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:24 CronJobList: Adding job 'mips' 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:24 CronJobList: Adding job 'kflops' 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:24 CronJob: Initializing job 'mips' (/usr/libexec/condor/condor_mips) 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:24 CronJob: Initializing job 'kflops' (/usr/libexec/condor/condor_kflops) 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:24 State change: IS_OWNER is false 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:24 Changing state: Owner -> Unclaimed 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:24 State change: RunBenchmarks is TRUE 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:24 Changing activity: Idle -> Benchmarking 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:24 BenchMgr:StartBenchmarks() 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:27 Initial update sent to collector(s) 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:27 Sending DC_SET_READY message to master <10.0.2.15:43927?addrs=10.0.2.15-43927> 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:45 State change: benchmarks completed 2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:45 Changing activity: Benchmarking -> Idle 2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 No resources have been claimed for 600 seconds 2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 Shutting down Condor on this machine. 2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 Got SIGTERM. Performing graceful shutdown. 2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 shutdown graceful 2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 Cron: Killing all jobs 2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 Killing job multicore 2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 Cron: Killing all jobs 2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 Killing job mips 2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 Killing job kflops 2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 Deleting cron job manager 2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 Cron: Killing all jobs 2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 Killing job multicore 2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 CronJob: 'multicore': Trying to kill illegal PID 0 2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 Cron: Killing all jobs 2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 Killing job multicore 2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 CronJob: 'multicore': Trying to kill illegal PID 0 2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 CronJobList: Deleting all jobs 2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 CronJobList: Deleting job 'multicore' 2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 CronJob: Deleting job 'multicore' (/usr/local/bin/multicore-shutdown), timer 9 2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 CronJob: 'multicore': Trying to kill illegal PID 0 2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 Cron: Killing all jobs 2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 CronJobList: Deleting all jobs 2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 Deleting benchmark job mgr 2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 Cron: Killing all jobs 2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 Killing job mips 2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 Killing job kflops 2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 Cron: Killing all jobs 2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 Killing job mips 2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 Killing job kflops 2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 CronJobList: Deleting all jobs 2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 CronJobList: Deleting job 'mips' 2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 CronJob: Deleting job 'mips' (/usr/libexec/condor/condor_mips), timer -1 2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 CronJobList: Deleting job 'kflops' 2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 CronJob: Deleting job 'kflops' (/usr/libexec/condor/condor_kflops), timer -1 2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 Cron: Killing all jobs 2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 CronJobList: Deleting all jobs 2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 All resources are free, exiting. 2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 **** condor_startd (condor_STARTD) pid 10244 EXITING WITH STATUS 0 2020-05-11 00:06:27 (157820): Guest Log: [ERROR] No jobs were available to run. 2020-05-11 00:06:27 (157820): Guest Log: [INFO] Shutting Down. 2020-05-11 00:06:27 (157820): VM Completion File Detected. 2020-05-11 00:06:27 (157820): VM Completion Message: No jobs were available to run. . 2020-05-11 00:06:27 (157820): Powering off VM. 2020-05-11 00:11:28 (157820): VM did not power off when requested. 2020-05-11 00:11:28 (157820): VM was successfully terminated. 2020-05-11 00:11:28 (157820): Deregistering VM. (boinc_312d6963e4768bbf, slot#6) 2020-05-11 00:11:28 (157820): Removing network bandwidth throttle group from VM. 2020-05-11 00:11:28 (157820): Removing VM from VirtualBox. 00:11:33 (157820): called boinc_finish(207) </stderr_txt> |
Send message Joined: 2 May 07 Posts: 2243 Credit: 173,902,375 RAC: 1,652 |
You can let running Atlas or Theory, until we get the ok from the CMS-Team! |
Send message Joined: 9 Sep 19 Posts: 32 Credit: 2,856,470 RAC: 0 |
now everything is running without problem congratulation to the programmer who solved the bug |
Send message Joined: 9 Sep 19 Posts: 32 Credit: 2,856,470 RAC: 0 |
new error are occured but not after 18 mimutes but after 22 minutes of running the resume of the task in error is different He is Killing a lot of jobs i look if the origin is one or more computers |
Send message Joined: 9 Sep 19 Posts: 32 Credit: 2,856,470 RAC: 0 |
my whole computers where making errors and i excuse me but it is a programmi,ng problem or a mistake of mathematical domain définition in the program I think an error treatment Inside of the program could prevent such mistake |
Send message Joined: 2 May 07 Posts: 2243 Credit: 173,902,375 RAC: 1,652 |
You can let running Atlas or Theory, until we get the ok from the CMS-Team! |
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,455 RAC: 1,081 |
Hi everyone, long time no see. Let's just say it's been a bad year for me, and then Covid-19 struck and it became a bad year for everyone. As you've noticed, we are still having problems with CMS@Home. We've been submitting small batches of jobs to try to catch the culprits and we are a bit nearer to finding them. For the job submission process as I see it: o We create a batch of jobs using WMAgent, and WMAgent sends some of them off to the HTCondor server. As the condor server returns jobs, as successes or failures,WMAgent replenishes the pool of jobs on the condor server. o The BOINC server watches the condor server, and when it sees jobs are available, it creates tasks in the queue for CMS@Home. When there are no jobs available, it allows the task queue to drain. o When one of the volunteer VMs gets a BOINC task from the server it spins up a Virtual Machine which asks the condor server for a job. If it gets a job within 10 minutes it starts to run it, otherwise the task fails. o If a job on a VM returns failure to the condor server, it will be requeued -- i believe the default number of tries is three before failure is reported to WMAgent o If the condor server returns failure for a job back to the WMAgent then it requeues it for further submission to the condor server. It is supposed to be resubmitted with all the same requirements as before, but a change deep within WMAgent code last year means that they actually get sent with a requirement "Do not run on a volunteer machine". The WMAgent developers think that this is because condor is returning bad information on the job; we are still trying to figure out if this is the case. o So this leaves the condor server with jobs it believes won't run on volunteer machines, while volunteer tasks are requesting jobs, and failing when they hit the ten minute time-out. o Meanwhile, the BOINC server notices the task failures and gradually reduces the task quota for the VM until just one task per day is permitted. Because there are still jobs in the condor queue, the BOINC server continues to create tasks for its queue. o There's another side to this that we don't fully understand, and have only once definitely caught "in the act". It seems that if jobs sit in the "pending" queue for several days then we see them being successfully run again. We think there's a condor timeout and they are returned to WMAgent which then resubmits them with the correct requirements. There have been other failures in the CERN IT infrastructure which have also hampered our efforts, and naturally they can take longer to fix than in normal times. At the moment I have a batch of 500 jobs being processed. A previous batch of 100 jobs ran without any obvious hitch, but this one currently has 200 jobs in the pending state. It's been like that for a few days, and we are waiting to see if these will suddenly be released to volunteers again. Now my apologies for all this, and for the lack of communication while we fight the problem. When jobs become available, you do pick up on them fairly quickly. However, because we are only releasing small batches intermittently, we don't really need an army of machines spinning their wheels waiting for jobs. Do please feel free to set No New Tasks, or migrate to other projects, while we try to sort out our difficulties. |
©2024 CERN