Message boards : CMS Application : CMS Tasks Failing
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 13 · 14 · 15 · 16 · 17 · 18 · 19 . . . 22 · Next

AuthorMessage
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1114
Credit: 49,503,363
RAC: 3,820
Message 41157 - Posted: 4 Jan 2020, 8:40:38 UTC - in response to Message 41156.  

YES that is hard to believe and what usually happens is members see that and then d/l a batch on several computers and then just try a couple and find they fail and most of the time ran for a few hours before failing and then they suspend them and wonder what happened.

But 7,931 running is hard to believe even when they are working.
ID: 41157 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,453,985
RAC: 103,272
Message 41215 - Posted: 9 Jan 2020, 10:03:36 UTC

what I see from the Server Status page, the number of "users in the past 24 hours" has dropped to 79 (a week ago, it was, if I remember correctly, about 150).

Could this mean that the CMS tasks are still failing? Would so far really nobody have taken care of the problem?
ID: 41215 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 223,010,346
RAC: 136,248
Message 41220 - Posted: 9 Jan 2020, 11:43:54 UTC - in response to Message 41215.  

A CMS task represents nothing but an envelope to run CMS subtasks (= jobs).
The BOINC server page shows some statistics about those envelopes.
Subtask statistics can be seen at the grafana pages, e.g.:
https://monit-grafana.cern.ch/d/000000628/cms-job-monitoring?orgId=11&from=now-3d&to=now-12m&refresh=15m&var-group_by=CMS_JobType&var-Tier=All&var-Site=T3_CH_Volunteer&var-Type=All&var-CMS_JobType=All&var-CMSPrimaryDataTier=All&var-binning=1h&var-measurement=condor_1h&var-retention_policy=long
If you don't have an original CERN account you may use your public account from facebook or google to log in.

Grafana shows that ATM 160 subtasks are running.
Out of experience this is a normal value.

The numbers are usually (much) higher - tasks as well as subtasks and users - when other LHC subprojects dry out and volunteers who allow more of them get CMS instead of other subprojects.
ID: 41220 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,453,985
RAC: 103,272
Message 41257 - Posted: 14 Jan 2020, 15:21:30 UTC

after some waiting time, I tried CMS again, just to find out that the tasks are still failing:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=259203295

this task failed after 5 hours 11 minutes:

1 (0x00000001) Unknown error code

2020-01-14 13:51:13 (19508): Guest Log: [ERROR] Condor ended after 18039 seconds.


After this problem now exists for quite some time, I am wondering why no one is fixing it.
BTW, havn't heard from Ivan for long time - is he no longer on bord?
ID: 41257 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 41261 - Posted: 14 Jan 2020, 20:46:56 UTC - in response to Message 41257.  

Thanks. I have a machine that will be available in a few days.
It looks like it should go on ATLAS.
ID: 41261 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 223,010,346
RAC: 136,248
Message 41263 - Posted: 14 Jan 2020, 21:45:14 UTC - in response to Message 41257.  

Grafana shows a failure peak between 13:00 and 14:00 CET today.
At present the failure rate is down to 2.5 % and the #running jobs is 321.

My own tasks show an error rate of 0.4 % since 2020-01-06.
Rather good for a complex app like CMS, isn't it?
ID: 41263 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 41267 - Posted: 15 Jan 2020, 11:53:30 UTC - in response to Message 41263.  
Last modified: 15 Jan 2020, 11:53:51 UTC

My own tasks show an error rate of 0.4 % since 2020-01-06.
Rather good for a complex app like CMS, isn't it?

Yes, that is good enough for me. Fortunately, you can now run native ATLAS along with CMS, so one can serve as the backup for the other.
ID: 41267 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,637
RAC: 1,939
Message 42347 - Posted: 1 May 2020, 7:30:19 UTC
Last modified: 1 May 2020, 7:33:57 UTC

CMS-tasks fail due to not getting X509 credentials. https://lhcathome.cern.ch/lhcathome/result.php?resultid=272193767

ID: 42347 · Report as offensive     Reply Quote
princah5

Send message
Joined: 14 Jan 20
Posts: 2
Credit: 29,211
RAC: 0
Message 42355 - Posted: 1 May 2020, 12:30:10 UTC - in response to Message 42347.  

Same for me (win / vbox), CMS tasks fail with X509 credential error
ID: 42355 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 223,010,346
RAC: 136,248
Message 42359 - Posted: 1 May 2020, 13:45:42 UTC - in response to Message 42355.  

Same for me (win / vbox), CMS tasks fail with X509 credential error

Beside the fact that the project team has not officially announced CMS being stable again this setting might be changed:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=272197689
2020-05-01 12:16:48 (7752): Setting CPU throttle for VM. (60%)

Simple native BOINC apps without permanent network access usually can be throttled to a very low percentage.
Complex apps like that from LHC@home using VirtualBox should run at 100 % whenever possible.
This avoids timing problems on various levels.
To reduce the total load on a computer or work against temperature/noise problems it would be better to reduce the #cores used by BOINC.
ID: 42359 · Report as offensive     Reply Quote
PaoloNasca

Send message
Joined: 11 Jul 19
Posts: 6
Credit: 1,605,025
RAC: 557
Message 42360 - Posted: 1 May 2020, 15:01:01 UTC
Last modified: 1 May 2020, 15:12:28 UTC

I'm facing the same issue. All WUs from "CMS Simulation v50.00 (vbox64) windows_x86_64" application fail.
https://lhcathome.cern.ch/lhcathome/result.php?resultid=272165131

ERROR: Couldn't read proxy from: /tmp/x509up_u0
globus_credential: Error reading proxy credential
globus_credential: Error reading proxy credential: Couldn't read PEM from bio
OpenSSL Error: pem_lib.c:703: in library: PEM routines, function PEM_read_bio: no start line
Guest Log: Use -debug for further information.
[ERROR] Could not get an x509 credential
[ERROR] The x509 proxy creation failed.
ID: 42360 · Report as offensive     Reply Quote
PaoloNasca

Send message
Joined: 11 Jul 19
Posts: 6
Credit: 1,605,025
RAC: 557
Message 42398 - Posted: 10 May 2020, 14:47:36 UTC

Today the error is about Condor

https://lhcathome.cern.ch/lhcathome/result.php?resultid=272679370

Guest Log: [INFO] CMS application starting. Check log files.
Guest Log: [DEBUG] HTCondor ping
Guest Log: [DEBUG] 0
Guest Log: [ERROR] Condor ended after 1324 seconds.
Guest Log: [INFO] Shutting Down.
ID: 42398 · Report as offensive     Reply Quote
benefique pour tous

Send message
Joined: 9 Sep 19
Posts: 32
Credit: 2,856,470
RAC: 0
Message 42400 - Posted: 10 May 2020, 22:23:01 UTC - in response to Message 30718.  

today the cms tasks are only failing
I looked the resume
The tasks should be bad created as multicore but are running as single core and failing
ID: 42400 · Report as offensive     Reply Quote
benefique pour tous

Send message
Joined: 9 Sep 19
Posts: 32
Credit: 2,856,470
RAC: 0
Message 42403 - Posted: 10 May 2020, 22:27:12 UTC - in response to Message 42400.  

at the programmer should understand what is failing:



<core_client_version>7.16.5</core_client_version>
<![CDATA[
<message>
La pile de l - exit code 207 (0xcf)</message>
<stderr_txt>
2020-05-10 23:53:19 (157820): Detected: vboxwrapper 26197
2020-05-10 23:53:19 (157820): Detected: BOINC client v7.7
2020-05-10 23:53:20 (157820): Detected: VirtualBox VboxManage Interface (Version: 6.0.14)
2020-05-10 23:53:20 (157820): Detected: Heartbeat check (file: 'heartbeat' every 1200.000000 seconds)
2020-05-10 23:53:20 (157820): Successfully copied 'init_data.xml' to the shared directory.
2020-05-10 23:53:22 (157820): Create VM. (boinc_312d6963e4768bbf, slot#6)
2020-05-10 23:53:22 (157820): Setting Memory Size for VM. (2048MB)
2020-05-10 23:53:22 (157820): Setting CPU Count for VM. (1)
2020-05-10 23:53:23 (157820): Setting Chipset Options for VM.
2020-05-10 23:53:23 (157820): Setting Boot Options for VM.
2020-05-10 23:53:23 (157820): Setting Network Configuration for NAT.
2020-05-10 23:53:23 (157820): Enabling VM Network Access.
2020-05-10 23:53:24 (157820): Disabling USB Support for VM.
2020-05-10 23:53:25 (157820): Disabling COM Port Support for VM.
2020-05-10 23:53:25 (157820): Disabling LPT Port Support for VM.
2020-05-10 23:53:25 (157820): Disabling Audio Support for VM.
2020-05-10 23:53:25 (157820): Disabling Clipboard Support for VM.
2020-05-10 23:53:26 (157820): Disabling Drag and Drop Support for VM.
2020-05-10 23:53:26 (157820): Adding storage controller(s) to VM.
2020-05-10 23:53:26 (157820): Adding virtual disk drive to VM. (vm_image.vdi)
2020-05-10 23:53:27 (157820): Adding VirtualBox Guest Additions to VM.
2020-05-10 23:53:27 (157820): Adding network bandwidth throttle group to VM. (Defaulting to 1024GB)
2020-05-10 23:53:27 (157820): forwarding host port 65246 to guest port 80
2020-05-10 23:53:27 (157820): Enabling remote desktop for VM.
2020-05-10 23:53:28 (157820): Enabling shared directory for VM.
2020-05-10 23:53:28 (157820): Starting VM using VBoxManage interface. (boinc_312d6963e4768bbf, slot#6)
2020-05-10 23:53:33 (157820): Successfully started VM. (PID = '142712')
2020-05-10 23:53:33 (157820): Reporting VM Process ID to BOINC.
2020-05-10 23:53:33 (157820): Guest Log: BIOS: VirtualBox 6.0.14

2020-05-10 23:53:33 (157820): Guest Log: CPUID EDX: 0x178bfbff

2020-05-10 23:53:33 (157820): Guest Log: BIOS: ata0-0: PCHS=16383/16/63 LCHS=1024/255/63

2020-05-10 23:53:33 (157820): VM state change detected. (old = 'PoweredOff', new = 'Running')
2020-05-10 23:53:33 (157820): Detected: Web Application Enabled (http://localhost:65246)
2020-05-10 23:53:33 (157820): Detected: Remote Desktop Enabled (localhost:65247)
2020-05-10 23:53:33 (157820): Preference change detected
2020-05-10 23:53:33 (157820): Setting CPU throttle for VM. (80%)
2020-05-10 23:53:34 (157820): Setting checkpoint interval to 600 seconds. (Higher value of (Preference: 60 seconds) or (Vbox_job.xml: 600 seconds))
2020-05-10 23:53:35 (157820): Guest Log: BIOS: Boot : bseqnr=1, bootseq=0032

2020-05-10 23:53:35 (157820): Guest Log: BIOS: Booting from Hard Disk...

2020-05-10 23:53:37 (157820): Guest Log: BIOS: KBD: unsupported int 16h function 03

2020-05-10 23:53:37 (157820): Guest Log: BIOS: AX=0305 BX=0000 CX=0000 DX=0000

2020-05-10 23:53:52 (157820): Guest Log: vgdrvHeartbeatInit: Setting up heartbeat to trigger every 2000 milliseconds

2020-05-10 23:53:52 (157820): Guest Log: vboxguest: misc device minor 56, IRQ 20, I/O port d020, MMIO at 00000000f0400000 (size 0x400000)

2020-05-10 23:54:14 (157820): Guest Log: VBoxService 5.2.6 r120293 (verbosity: 0) linux.amd64 (Jan 15 2018 14:51:00) release log

2020-05-10 23:54:14 (157820): Guest Log: 00:00:00.000125 main Log opened 2020-05-10T21:54:14.490535000Z

2020-05-10 23:54:14 (157820): Guest Log: 00:00:00.000275 main OS Product: Linux

2020-05-10 23:54:14 (157820): Guest Log: 00:00:00.000326 main OS Release: 4.14.157-17.cernvm.x86_64

2020-05-10 23:54:14 (157820): Guest Log: 00:00:00.000348 main OS Version: #1 SMP Wed Dec 4 17:26:45 CET 2019

2020-05-10 23:54:14 (157820): Guest Log: 00:00:00.000367 main Executable: /usr/share/vboxguest52/usr/sbin/VBoxService

2020-05-10 23:54:14 (157820): Guest Log: 00:00:00.000368 main Process ID: 2948

2020-05-10 23:54:14 (157820): Guest Log: 00:00:00.000368 main Package type: LINUX_64BITS_GENERIC

2020-05-10 23:54:14 (157820): Guest Log: 00:00:00.001838 main 5.2.6 r120293 started. Verbose level = 0

2020-05-10 23:54:25 (157820): Guest Log: [INFO] Mounting the shared directory

2020-05-10 23:54:25 (157820): Guest Log: [INFO] Shared directory mounted, enabling vboxmonitor

2020-05-10 23:54:25 (157820): Guest Log: [DEBUG] Testing network connection to cern.ch on port 80

2020-05-10 23:54:25 (157820): Guest Log: [DEBUG] Connection to cern.ch 80 port [tcp/http] succeeded!

2020-05-10 23:54:25 (157820): Guest Log: [DEBUG] 0

2020-05-10 23:54:25 (157820): Guest Log: [DEBUG] Testing VCCS connection to vccs.cern.ch on port 443

2020-05-10 23:54:25 (157820): Guest Log: [DEBUG] Connection to vccs.cern.ch 443 port [tcp/https] succeeded!

2020-05-10 23:54:25 (157820): Guest Log: [DEBUG] 0

2020-05-10 23:54:25 (157820): Guest Log: [DEBUG] Testing connection to Condor server on port 9618

2020-05-10 23:54:25 (157820): Guest Log: [DEBUG] Connection to vocms0840.cern.ch 9618 port [tcp/condor] succeeded!

2020-05-10 23:54:26 (157820): Guest Log: [DEBUG] 0

2020-05-10 23:55:28 (157820): Guest Log: [DEBUG] Probing CVMFS ...

2020-05-10 23:55:29 (157820): Guest Log: Probing /cvmfs/grid.cern.ch... OK

2020-05-10 23:55:29 (157820): Guest Log: VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE

2020-05-10 23:55:29 (157820): Guest Log: 2.4.4.0 3713 1 25848 12197 4 1 1242455 4096000 2 65024 0 3 100 0 0 http://s1cern-cvmfs.openhtc.io/cvmfs/grid.cern.ch DIRECT 1

2020-05-10 23:55:33 (157820): Guest Log: [INFO] Reading volunteer information

2020-05-10 23:55:33 (157820): Guest Log: [INFO] Volunteer: Guy PF Masevaux (589052)

2020-05-10 23:55:33 (157820): Guest Log: [INFO] VMID: 49b2fac1-df25-48d2-a4ee-4612ca6a31f8

2020-05-10 23:55:34 (157820): Guest Log: [INFO] Requesting an X509 credential from LHC@home

2020-05-10 23:55:34 (157820): Guest Log: [INFO] Running the fast benchmark.

2020-05-10 23:55:59 (157820): Guest Log: [INFO] Machine performance 20.11 HEPSPEC06

2020-05-10 23:55:59 (157820): Guest Log: [INFO] CMS application starting. Check log files.

2020-05-10 23:56:00 (157820): Guest Log: [DEBUG] HTCondor ping

2020-05-10 23:56:01 (157820): Guest Log: [DEBUG] 0

2020-05-11 00:06:26 (157820): Guest Log: Did the tarball get created?

2020-05-11 00:06:26 (157820): Guest Log: /tmp/CMS_175225_1589144489.909597_0.tgz

2020-05-11 00:06:26 (157820): Guest Log: Here is the upload output

2020-05-11 00:06:27 (157820): Guest Log: Here is the upload error

2020-05-11 00:06:27 (157820): Guest Log: Here is the condor directory

2020-05-11 00:06:27 (157820): Guest Log: MasterLog

2020-05-11 00:06:27 (157820): Guest Log: ProcLog

2020-05-11 00:06:27 (157820): Guest Log: StarterLog

2020-05-11 00:06:27 (157820): Guest Log: StartLog

2020-05-11 00:06:27 (157820): Guest Log: XferStatsLog

2020-05-11 00:06:27 (157820): Guest Log: Here is the MasterLog

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:01 ******************************************************

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:01 ** condor_master (CONDOR_MASTER) STARTING UP

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:01 ** /usr/sbin/condor_master

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:01 ** SubsystemInfo: name=MASTER type=MASTER(2) class=DAEMON(1)

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:01 ** Configuration: subsystem:MASTER local:<NONE> class:DAEMON

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:01 ** $CondorVersion: 8.6.10 Mar 12 2018 BuildID: 435200 $

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:01 ** $CondorPlatform: x86_64_RedHat6 $

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:01 ** PID = 4695

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:01 ** Log last touched time unavailable (No such file or directory)

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:01 ******************************************************

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:01 Using config source: /etc/condor/condor_config

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:01 Using local config sources:

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:01 /etc/condor/config.d/10_security.config

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:01 /etc/condor/config.d/14_network.config

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:01 /etc/condor/config.d/20_workernode.config

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:01 /etc/condor/config.d/30_lease.config

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:01 /etc/condor/config.d/35_cms.config

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:01 /etc/condor/config.d/40_ccb.config

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:01 /etc/condor/config.d/62-benchmark.conf

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:01 /etc/condor/condor_config.local

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:01 config Macros = 170, Sorted = 170, StringBytes = 6830, TablesBytes = 6224

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:01 CLASSAD_CACHING is OFF

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:01 Daemon Log is logging: D_ALWAYS D_ERROR

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:01 Daemoncore: Listening at <10.0.2.15:43927> on TCP (ReliSock).

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:01 DaemonCore: command socket at <10.0.2.15:43927?addrs=10.0.2.15-43927&noUDP>

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:01 DaemonCore: private command socket at <10.0.2.15:43927?addrs=10.0.2.15-43927>

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:16 CCBListener: registered with CCB server vocms0840.cern.ch as ccbid 137.138.156.85:9618?addrs=137.138.156.85-9618#2081158

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:16 Master restart (GRACEFUL) is watching /usr/sbin/condor_master (mtime:1520893905)

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:17 Started DaemonCore process "/usr/sbin/condor_startd", pid and pgroup = 10244

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:27 Setting ready state 'Ready' for STARTD

2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 Got SIGTERM. Performing graceful shutdown.

2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 Sent SIGTERM to STARTD (pid 10244)

2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 AllReaper unexpectedly called on pid 10244, status 0.

2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 The STARTD (pid 10244) exited with status 0

2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 All daemons are gone. Exiting.

2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 **** condor_master (condor_MASTER) pid 4695 EXITING WITH STATUS 0

2020-05-11 00:06:27 (157820): Guest Log: Here is the KernelTuning.log

2020-05-11 00:06:27 (157820): Guest Log: Here is the StartLog

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:17 ******************************************************

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:17 ** condor_startd (CONDOR_STARTD) STARTING UP

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:17 ** /usr/sbin/condor_startd

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:17 ** SubsystemInfo: name=STARTD type=STARTD(7) class=DAEMON(1)

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:17 ** Configuration: subsystem:STARTD local:<NONE> class:DAEMON

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:17 ** $CondorVersion: 8.6.10 Mar 12 2018 BuildID: 435200 $

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:17 ** $CondorPlatform: x86_64_RedHat6 $

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:17 ** PID = 10244

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:17 ** Log last touched time unavailable (No such file or directory)

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:17 ******************************************************

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:17 Using config source: /etc/condor/condor_config

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:17 Using local config sources:

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:17 /etc/condor/config.d/10_security.config

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:17 /etc/condor/config.d/14_network.config

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:17 /etc/condor/config.d/20_workernode.config

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:17 /etc/condor/config.d/30_lease.config

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:17 /etc/condor/config.d/35_cms.config

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:17 /etc/condor/config.d/40_ccb.config

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:17 /etc/condor/config.d/62-benchmark.conf

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:17 /etc/condor/condor_config.local

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:17 config Macros = 171, Sorted = 171, StringBytes = 6856, TablesBytes = 6260

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:17 CLASSAD_CACHING is ENABLED

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:17 Daemon Log is logging: D_ALWAYS D_ERROR

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:17 Daemoncore: Listening at <10.0.2.15:41863> on TCP (ReliSock).

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:17 DaemonCore: command socket at <10.0.2.15:41863?addrs=10.0.2.15-41863&noUDP>

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:17 DaemonCore: private command socket at <10.0.2.15:41863?addrs=10.0.2.15-41863>

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:22 CCBListener: registered with CCB server vocms0840.cern.ch as ccbid 137.138.156.85:9618?addrs=137.138.156.85-9618#2081160

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:24 VM-gahp server reported an internal error

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:24 VM universe will be tested to check if it is available

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:24 History file rotation is enabled.

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:24 Maximum history file size is: 20971520 bytes

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:24 Number of rotated history files is: 2

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:24 Allocating auto shares for slot type 0: Cpus: auto, Memory: auto, Swap: auto, Disk: auto

2020-05-11 00:06:27 (157820): Guest Log: slot type 0: Cpus: 1.000000, Memory: 3000, Swap: 100.00%, Disk: 100.00%

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:24 New machine resource allocated

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:24 Setting up slot pairings

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:24 CronJobList: Adding job 'multicore'

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:24 CronJob: Initializing job 'multicore' (/usr/local/bin/multicore-shutdown)

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:24 CronJobList: Adding job 'mips'

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:24 CronJobList: Adding job 'kflops'

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:24 CronJob: Initializing job 'mips' (/usr/libexec/condor/condor_mips)

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:24 CronJob: Initializing job 'kflops' (/usr/libexec/condor/condor_kflops)

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:24 State change: IS_OWNER is false

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:24 Changing state: Owner -> Unclaimed

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:24 State change: RunBenchmarks is TRUE

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:24 Changing activity: Idle -> Benchmarking

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:24 BenchMgr:StartBenchmarks()

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:27 Initial update sent to collector(s)

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:27 Sending DC_SET_READY message to master <10.0.2.15:43927?addrs=10.0.2.15-43927>

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:45 State change: benchmarks completed

2020-05-11 00:06:27 (157820): Guest Log: 05/10/20 23:56:45 Changing activity: Benchmarking -> Idle

2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 No resources have been claimed for 600 seconds

2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 Shutting down Condor on this machine.

2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 Got SIGTERM. Performing graceful shutdown.

2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 shutdown graceful

2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 Cron: Killing all jobs

2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 Killing job multicore

2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 Cron: Killing all jobs

2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 Killing job mips

2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 Killing job kflops

2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 Deleting cron job manager

2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 Cron: Killing all jobs

2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 Killing job multicore

2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 CronJob: 'multicore': Trying to kill illegal PID 0

2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 Cron: Killing all jobs

2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 Killing job multicore

2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 CronJob: 'multicore': Trying to kill illegal PID 0

2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 CronJobList: Deleting all jobs

2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 CronJobList: Deleting job 'multicore'

2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 CronJob: Deleting job 'multicore' (/usr/local/bin/multicore-shutdown), timer 9

2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 CronJob: 'multicore': Trying to kill illegal PID 0

2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 Cron: Killing all jobs

2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 CronJobList: Deleting all jobs

2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 Deleting benchmark job mgr

2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 Cron: Killing all jobs

2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 Killing job mips

2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 Killing job kflops

2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 Cron: Killing all jobs

2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 Killing job mips

2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 Killing job kflops

2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 CronJobList: Deleting all jobs

2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 CronJobList: Deleting job 'mips'

2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 CronJob: Deleting job 'mips' (/usr/libexec/condor/condor_mips), timer -1

2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 CronJobList: Deleting job 'kflops'

2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 CronJob: Deleting job 'kflops' (/usr/libexec/condor/condor_kflops), timer -1

2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 Cron: Killing all jobs

2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 CronJobList: Deleting all jobs

2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 All resources are free, exiting.

2020-05-11 00:06:27 (157820): Guest Log: 05/11/20 00:06:24 **** condor_startd (condor_STARTD) pid 10244 EXITING WITH STATUS 0

2020-05-11 00:06:27 (157820): Guest Log: [ERROR] No jobs were available to run.

2020-05-11 00:06:27 (157820): Guest Log: [INFO] Shutting Down.

2020-05-11 00:06:27 (157820): VM Completion File Detected.
2020-05-11 00:06:27 (157820): VM Completion Message: No jobs were available to run.
.
2020-05-11 00:06:27 (157820): Powering off VM.
2020-05-11 00:11:28 (157820): VM did not power off when requested.
2020-05-11 00:11:28 (157820): VM was successfully terminated.
2020-05-11 00:11:28 (157820): Deregistering VM. (boinc_312d6963e4768bbf, slot#6)
2020-05-11 00:11:28 (157820): Removing network bandwidth throttle group from VM.
2020-05-11 00:11:28 (157820): Removing VM from VirtualBox.
00:11:33 (157820): called boinc_finish(207)

</stderr_txt>
ID: 42403 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,184,186
RAC: 104,580
Message 42405 - Posted: 11 May 2020, 6:09:41 UTC

You can let running Atlas or Theory, until we get the ok from the CMS-Team!
ID: 42405 · Report as offensive     Reply Quote
benefique pour tous

Send message
Joined: 9 Sep 19
Posts: 32
Credit: 2,856,470
RAC: 0
Message 42425 - Posted: 12 May 2020, 2:04:03 UTC - in response to Message 42405.  

now everything is running without problem
congratulation to the programmer who solved the bug
ID: 42425 · Report as offensive     Reply Quote
benefique pour tous

Send message
Joined: 9 Sep 19
Posts: 32
Credit: 2,856,470
RAC: 0
Message 42426 - Posted: 12 May 2020, 2:11:36 UTC - in response to Message 42425.  

new error are occured but not after 18 mimutes but after 22 minutes of running
the resume of the task in error is different
He is Killing a lot of jobs
i look if the origin is one or more computers
ID: 42426 · Report as offensive     Reply Quote
benefique pour tous

Send message
Joined: 9 Sep 19
Posts: 32
Credit: 2,856,470
RAC: 0
Message 42427 - Posted: 12 May 2020, 2:20:20 UTC - in response to Message 42426.  

my whole computers where making errors and i excuse me but it is a programmi,ng problem or a mistake of mathematical domain définition in the program
I think an error treatment Inside of the program could prevent such mistake
ID: 42427 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,184,186
RAC: 104,580
Message 42428 - Posted: 12 May 2020, 2:29:52 UTC - in response to Message 42405.  

You can let running Atlas or Theory, until we get the ok from the CMS-Team!
ID: 42428 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 998
Credit: 6,264,307
RAC: 71
Message 42434 - Posted: 12 May 2020, 11:38:32 UTC

Hi everyone, long time no see. Let's just say it's been a bad year for me, and then Covid-19 struck and it became a bad year for everyone.
As you've noticed, we are still having problems with CMS@Home. We've been submitting small batches of jobs to try to catch the culprits and we are a bit nearer to finding them.
For the job submission process as I see it:
o We create a batch of jobs using WMAgent, and WMAgent sends some of them off to the HTCondor server. As the condor server returns jobs, as successes or failures,WMAgent replenishes the pool of jobs on the condor server.
o The BOINC server watches the condor server, and when it sees jobs are available, it creates tasks in the queue for CMS@Home. When there are no jobs available, it allows the task queue to drain.
o When one of the volunteer VMs gets a BOINC task from the server it spins up a Virtual Machine which asks the condor server for a job. If it gets a job within 10 minutes it starts to run it, otherwise the task fails.
o If a job on a VM returns failure to the condor server, it will be requeued -- i believe the default number of tries is three before failure is reported to WMAgent
o If the condor server returns failure for a job back to the WMAgent then it requeues it for further submission to the condor server. It is supposed to be resubmitted with all the same requirements as before, but a change deep within WMAgent code last year means that they actually get sent with a requirement "Do not run on a volunteer machine". The WMAgent developers think that this is because condor is returning bad information on the job; we are still trying to figure out if this is the case.
o So this leaves the condor server with jobs it believes won't run on volunteer machines, while volunteer tasks are requesting jobs, and failing when they hit the ten minute time-out.
o Meanwhile, the BOINC server notices the task failures and gradually reduces the task quota for the VM until just one task per day is permitted. Because there are still jobs in the condor queue, the BOINC server continues to create tasks for its queue.
o There's another side to this that we don't fully understand, and have only once definitely caught "in the act". It seems that if jobs sit in the "pending" queue for several days then we see them being successfully run again. We think there's a condor timeout and they are returned to WMAgent which then resubmits them with the correct requirements.
There have been other failures in the CERN IT infrastructure which have also hampered our efforts, and naturally they can take longer to fix than in normal times.
At the moment I have a batch of 500 jobs being processed. A previous batch of 100 jobs ran without any obvious hitch, but this one currently has 200 jobs in the pending state. It's been like that for a few days, and we are waiting to see if these will suddenly be released to volunteers again.
Now my apologies for all this, and for the lack of communication while we fight the problem. When jobs become available, you do pick up on them fairly quickly. However, because we are only releasing small batches intermittently, we don't really need an army of machines spinning their wheels waiting for jobs. Do please feel free to set No New Tasks, or migrate to other projects, while we try to sort out our difficulties.
ID: 42434 · Report as offensive     Reply Quote
Previous · 1 . . . 13 · 14 · 15 · 16 · 17 · 18 · 19 . . . 22 · Next

Message boards : CMS Application : CMS Tasks Failing


©2024 CERN