Message boards :
CMS Application :
CMS jobs are becoming available again
Message board moderation
Previous · 1 · 2 · 3 · 4
Author | Message |
---|---|
Send message Joined: 29 Aug 05 Posts: 997 Credit: 6,264,307 RAC: 71 |
|
Send message Joined: 9 Dec 14 Posts: 202 Credit: 2,533,875 RAC: 0 |
One of my hosts shows messages like this within the VM (e.g. /logs/finished_7.log): Setting up Frontier log level Beginning CMSSW wrapper script slc6_amd64_gcc700 scramv1 CMSSW Performing SCRAM setup... Completed SCRAM setup Retrieving SCRAM project... Completed SCRAM project Executing CMSSW cmsRun -j FrameworkJobReport.xml PSet.py ----- Begin Fatal Exception 04-Jun-2019 09:23:03 UTC----------------------- An exception of category 'Incomplete configuration' occurred while [0] Constructing the EventProcessor [1] Constructing ESSource: class=PoolDBESSource label='GlobalTag' Exception Message: Valid site-local-config not found at /cvmfs/cms.cern.ch/SITECONF/local/JobConfig/site-local-config.xml ----- End Fatal Exception ------------------------------------------------- Complete process id is 270 status is 65 The starterlog shows (/logs/StarterLog) for example: 06/04/19 10:47:43 (pid:8452) ** condor_starter (CONDOR_STARTER) STARTING UP 06/04/19 10:47:43 (pid:8452) ** /usr/sbin/condor_starter 06/04/19 10:47:43 (pid:8452) ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1) 06/04/19 10:47:43 (pid:8452) ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON 06/04/19 10:47:43 (pid:8452) ** $CondorVersion: 8.6.10 Mar 12 2018 BuildID: 435200 $ 06/04/19 10:47:43 (pid:8452) ** $CondorPlatform: x86_64_RedHat6 $ 06/04/19 10:47:43 (pid:8452) ** PID = 8452 06/04/19 10:47:43 (pid:8452) ** Log last touched 6/4 10:47:42 06/04/19 10:47:43 (pid:8452) ****************************************************** 06/04/19 10:47:43 (pid:8452) Using config source: /etc/condor/condor_config 06/04/19 10:47:43 (pid:8452) Using local config sources: 06/04/19 10:47:43 (pid:8452) /etc/condor/config.d/10_security.config 06/04/19 10:47:43 (pid:8452) /etc/condor/config.d/14_network.config 06/04/19 10:47:43 (pid:8452) /etc/condor/config.d/20_workernode.config 06/04/19 10:47:43 (pid:8452) /etc/condor/config.d/30_lease.config 06/04/19 10:47:43 (pid:8452) /etc/condor/config.d/35_cms.config 06/04/19 10:47:43 (pid:8452) /etc/condor/config.d/40_ccb.config 06/04/19 10:47:43 (pid:8452) /etc/condor/config.d/62-benchmark.conf 06/04/19 10:47:43 (pid:8452) /etc/condor/condor_config.local 06/04/19 10:47:43 (pid:8452) config Macros = 172, Sorted = 172, StringBytes = 6941, TablesBytes = 6296 06/04/19 10:47:43 (pid:8452) CLASSAD_CACHING is OFF 06/04/19 10:47:43 (pid:8452) Daemon Log is logging: D_ALWAYS D_ERROR 06/04/19 10:47:43 (pid:8452) Daemoncore: Listening at <10.0.2.15:45925> on TCP (ReliSock). 06/04/19 10:47:43 (pid:8452) DaemonCore: command socket at <10.0.2.15:45925?addrs=10.0.2.15-45925&noUDP> 06/04/19 10:47:43 (pid:8452) DaemonCore: private command socket at <10.0.2.15:45925?addrs=10.0.2.15-45925> 06/04/19 10:47:44 (pid:8452) CCBListener: registered with CCB server vocms0840.cern.ch as ccbid 137.138.156.85:9618?addrs=137.138.156.85-9618#633281 06/04/19 10:47:44 (pid:8452) Communicating with shadow <137.138.52.94:4080?addrs=137.138.52.94-4080&noUDP&sock=4298_1468_19663> 06/04/19 10:47:44 (pid:8452) Submitting machine is "vocms0267.cern.ch" 06/04/19 10:47:44 (pid:8452) setting the orig job name in starter 06/04/19 10:47:44 (pid:8452) setting the orig job iwd in starter 06/04/19 10:47:44 (pid:8452) Chirp config summary: IO false, Updates false, Delayed updates true. 06/04/19 10:47:44 (pid:8452) Initialized IO Proxy. 06/04/19 10:47:44 (pid:8452) Done setting resource limits 06/04/19 10:47:46 (pid:8452) File transfer completed successfully. 06/04/19 10:47:46 (pid:8452) Job 150691.2 set to execute immediately 06/04/19 10:47:46 (pid:8452) Starting a VANILLA universe job with ID: 150691.2 06/04/19 10:47:46 (pid:8452) IWD: /var/lib/condor/execute/dir_8452 06/04/19 10:47:46 (pid:8452) Output file: /var/lib/condor/execute/dir_8452/_condor_stdout 06/04/19 10:47:46 (pid:8452) Error file: /var/lib/condor/execute/dir_8452/_condor_stderr 06/04/19 10:47:46 (pid:8452) Renice expr "10" evaluated to 10 06/04/19 10:47:46 (pid:8452) Using wrapper /usr/local/bin/singularity_wrapper.sh to exec /var/lib/condor/execute/dir_8452/condor_exec.exe ireid_TC_OneTask_IDR_CMS_Home_190526_125903_8237-Sandbox.tar.bz2 89269 0 06/04/19 10:47:46 (pid:8452) Running job as user nobody 06/04/19 10:47:46 (pid:8452) Create_Process succeeded, pid=8466 06/04/19 10:52:47 (pid:8452) Process exited, pid=8466, status=1 06/04/19 10:52:48 (pid:8452) Got SIGQUIT. Performing fast shutdown. 06/04/19 10:52:48 (pid:8452) ShutdownFast all jobs. 06/04/19 10:52:48 (pid:8452) **** condor_starter (condor_STARTER) pid 8452 EXITING WITH STATUS 0 Does one of the experts know where the problem is located? According to the finished_7.log it looks like it can't find a valid site-local-config, but why? Vbox cpu usage is ~0%. The affected host has successfully crunched CMS tasks in the past. I am using a local proxy, which should be working fine (at least the theory tasks have no problem using it). |
Send message Joined: 29 Aug 05 Posts: 997 Credit: 6,264,307 RAC: 71 |
Does one of the experts know where the problem is located? According to the finished_7.log it looks like it can't find a valid site-local-config, but why? That looks like a CVMFS communication or corruption problem. On set-up we create a local site-local-config at the file location given in your code. This shouldn't change as it's a symlink to the T3_CH_Volunteer site-local-config, so perhaps we can rule out communications (I'm the one who maintains this file in the CVMFS repository, and I've not changed it in a while). Did you stop and restart your BOINC session while this task was running? If so, did you give it enough time to save the state of your VM for the checkpoint restart before powering down the PC? These VM tasks are a little finicky and especially don't like being suddenly interrupted. On the other hand, there was a proposed change to startup code this week, to allow more files to be cached by a local BOINC proxy but I haven't seen any messages saying, "Yes that code looks OK, go ahead," nor, "Oh well I've done it anyway!" The apparent spikes in failed jobs are worrying, they suggest something is failing but I don't have enough monitoring to hand to see what it is. If it's a continuation of the effects seen since Easter, the actual jobs are being requeued and rerun successfully AFAICT, and the tasks that ran them are being marked successful and given BOINC credit. |
Send message Joined: 15 Jun 08 Posts: 2386 Credit: 222,912,993 RAC: 138,070 |
This error type is caused by the bootstrap changes and affects only users that have a local proxy configured via their BOINC client. This changes do not yet work as expected and the issue is already under investigation. It affects only CMS. Volunteers who don't have a local proxy or configure it via other methods are not affected. If they encounter errors this may be due to other reasons. |
Send message Joined: 29 Aug 05 Posts: 997 Credit: 6,264,307 RAC: 71 |
|
Send message Joined: 9 Dec 14 Posts: 202 Credit: 2,533,875 RAC: 0 |
The new version of the new CMS boostrap fixed the issue mentioned here: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4964&postid=39036. Thanks everybody. |
©2024 CERN