Message boards : CMS Application : CMS jobs are becoming available again
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4

AuthorMessage
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 621
Credit: 4,682,691
RAC: 1,737
Message 39030 - Posted: 3 Jun 2019, 8:39:57 UTC - in response to Message 39025.  

I'm still having issues with my home PC but at least one work server picked up new tasks seamlessly.

Ah, I'd forgotten that I'd installed VirtualBox 6 on my home machine and not got it to work yet...
ID: 39030 · Report as offensive     Reply Quote
gyllic

Send message
Joined: 9 Dec 14
Posts: 201
Credit: 2,500,279
RAC: 651
Message 39036 - Posted: 4 Jun 2019, 9:49:13 UTC

One of my hosts shows messages like this within the VM (e.g. /logs/finished_7.log):
Setting up Frontier log level
Beginning CMSSW wrapper script
 slc6_amd64_gcc700 scramv1 CMSSW
Performing SCRAM setup...
Completed SCRAM setup
Retrieving SCRAM project...
Completed SCRAM project
Executing CMSSW
cmsRun  -j FrameworkJobReport.xml PSet.py
----- Begin Fatal Exception 04-Jun-2019 09:23:03 UTC-----------------------
An exception of category 'Incomplete configuration' occurred while
   [0] Constructing the EventProcessor
   [1] Constructing ESSource: class=PoolDBESSource label='GlobalTag'
Exception Message:
Valid site-local-config not found at /cvmfs/cms.cern.ch/SITECONF/local/JobConfig/site-local-config.xml
----- End Fatal Exception -------------------------------------------------
Complete
process id is 270 status is 65

The starterlog shows (/logs/StarterLog) for example:
06/04/19 10:47:43 (pid:8452) ** condor_starter (CONDOR_STARTER) STARTING UP
06/04/19 10:47:43 (pid:8452) ** /usr/sbin/condor_starter
06/04/19 10:47:43 (pid:8452) ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
06/04/19 10:47:43 (pid:8452) ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
06/04/19 10:47:43 (pid:8452) ** $CondorVersion: 8.6.10 Mar 12 2018 BuildID: 435200 $
06/04/19 10:47:43 (pid:8452) ** $CondorPlatform: x86_64_RedHat6 $
06/04/19 10:47:43 (pid:8452) ** PID = 8452
06/04/19 10:47:43 (pid:8452) ** Log last touched 6/4 10:47:42
06/04/19 10:47:43 (pid:8452) ******************************************************
06/04/19 10:47:43 (pid:8452) Using config source: /etc/condor/condor_config
06/04/19 10:47:43 (pid:8452) Using local config sources: 
06/04/19 10:47:43 (pid:8452)    /etc/condor/config.d/10_security.config
06/04/19 10:47:43 (pid:8452)    /etc/condor/config.d/14_network.config
06/04/19 10:47:43 (pid:8452)    /etc/condor/config.d/20_workernode.config
06/04/19 10:47:43 (pid:8452)    /etc/condor/config.d/30_lease.config
06/04/19 10:47:43 (pid:8452)    /etc/condor/config.d/35_cms.config
06/04/19 10:47:43 (pid:8452)    /etc/condor/config.d/40_ccb.config
06/04/19 10:47:43 (pid:8452)    /etc/condor/config.d/62-benchmark.conf
06/04/19 10:47:43 (pid:8452)    /etc/condor/condor_config.local
06/04/19 10:47:43 (pid:8452) config Macros = 172, Sorted = 172, StringBytes = 6941, TablesBytes = 6296
06/04/19 10:47:43 (pid:8452) CLASSAD_CACHING is OFF
06/04/19 10:47:43 (pid:8452) Daemon Log is logging: D_ALWAYS D_ERROR
06/04/19 10:47:43 (pid:8452) Daemoncore: Listening at <10.0.2.15:45925> on TCP (ReliSock).
06/04/19 10:47:43 (pid:8452) DaemonCore: command socket at <10.0.2.15:45925?addrs=10.0.2.15-45925&noUDP>
06/04/19 10:47:43 (pid:8452) DaemonCore: private command socket at <10.0.2.15:45925?addrs=10.0.2.15-45925>
06/04/19 10:47:44 (pid:8452) CCBListener: registered with CCB server vocms0840.cern.ch as ccbid 137.138.156.85:9618?addrs=137.138.156.85-9618#633281
06/04/19 10:47:44 (pid:8452) Communicating with shadow <137.138.52.94:4080?addrs=137.138.52.94-4080&noUDP&sock=4298_1468_19663>
06/04/19 10:47:44 (pid:8452) Submitting machine is "vocms0267.cern.ch"
06/04/19 10:47:44 (pid:8452) setting the orig job name in starter
06/04/19 10:47:44 (pid:8452) setting the orig job iwd in starter
06/04/19 10:47:44 (pid:8452) Chirp config summary: IO false, Updates false, Delayed updates true.
06/04/19 10:47:44 (pid:8452) Initialized IO Proxy.
06/04/19 10:47:44 (pid:8452) Done setting resource limits
06/04/19 10:47:46 (pid:8452) File transfer completed successfully.
06/04/19 10:47:46 (pid:8452) Job 150691.2 set to execute immediately
06/04/19 10:47:46 (pid:8452) Starting a VANILLA universe job with ID: 150691.2
06/04/19 10:47:46 (pid:8452) IWD: /var/lib/condor/execute/dir_8452
06/04/19 10:47:46 (pid:8452) Output file: /var/lib/condor/execute/dir_8452/_condor_stdout
06/04/19 10:47:46 (pid:8452) Error file: /var/lib/condor/execute/dir_8452/_condor_stderr
06/04/19 10:47:46 (pid:8452) Renice expr "10" evaluated to 10
06/04/19 10:47:46 (pid:8452) Using wrapper /usr/local/bin/singularity_wrapper.sh to exec /var/lib/condor/execute/dir_8452/condor_exec.exe ireid_TC_OneTask_IDR_CMS_Home_190526_125903_8237-Sandbox.tar.bz2 89269 0
06/04/19 10:47:46 (pid:8452) Running job as user nobody
06/04/19 10:47:46 (pid:8452) Create_Process succeeded, pid=8466
06/04/19 10:52:47 (pid:8452) Process exited, pid=8466, status=1
06/04/19 10:52:48 (pid:8452) Got SIGQUIT.  Performing fast shutdown.
06/04/19 10:52:48 (pid:8452) ShutdownFast all jobs.
06/04/19 10:52:48 (pid:8452) **** condor_starter (condor_STARTER) pid 8452 EXITING WITH STATUS 0


Does one of the experts know where the problem is located? According to the finished_7.log it looks like it can't find a valid site-local-config, but why?
Vbox cpu usage is ~0%. The affected host has successfully crunched CMS tasks in the past. I am using a local proxy, which should be working fine (at least the theory tasks have no problem using it).
ID: 39036 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 621
Credit: 4,682,691
RAC: 1,737
Message 39044 - Posted: 4 Jun 2019, 18:44:55 UTC - in response to Message 39036.  

Does one of the experts know where the problem is located? According to the finished_7.log it looks like it can't find a valid site-local-config, but why?
Vbox cpu usage is ~0%. The affected host has successfully crunched CMS tasks in the past. I am using a local proxy, which should be working fine (at least the theory tasks have no problem using it).

That looks like a CVMFS communication or corruption problem. On set-up we create a local site-local-config at the file location given in your code. This shouldn't change as it's a symlink to the T3_CH_Volunteer site-local-config, so perhaps we can rule out communications (I'm the one who maintains this file in the CVMFS repository, and I've not changed it in a while). Did you stop and restart your BOINC session while this task was running? If so, did you give it enough time to save the state of your VM for the checkpoint restart before powering down the PC? These VM tasks are a little finicky and especially don't like being suddenly interrupted.
On the other hand, there was a proposed change to startup code this week, to allow more files to be cached by a local BOINC proxy but I haven't seen any messages saying, "Yes that code looks OK, go ahead," nor, "Oh well I've done it anyway!" The apparent spikes in failed jobs are worrying, they suggest something is failing but I don't have enough monitoring to hand to see what it is. If it's a continuation of the effects seen since Easter, the actual jobs are being requeued and rerun successfully AFAICT, and the tasks that ran them are being marked successful and given BOINC credit.
ID: 39044 · Report as offensive     Reply Quote
computezrmle
Avatar

Send message
Joined: 15 Jun 08
Posts: 1133
Credit: 55,674,412
RAC: 104,837
Message 39045 - Posted: 4 Jun 2019, 19:39:57 UTC - in response to Message 39044.  

This error type is caused by the bootstrap changes and affects only users that have a local proxy configured via their BOINC client.
This changes do not yet work as expected and the issue is already under investigation.
It affects only CMS.

Volunteers who don't have a local proxy or configure it via other methods are not affected.
If they encounter errors this may be due to other reasons.
ID: 39045 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 621
Credit: 4,682,691
RAC: 1,737
Message 39047 - Posted: 5 Jun 2019, 5:42:08 UTC - in response to Message 39045.  

Thanks for the clarification -- I'm guessing there are messages in my inbox when I log on to my Uni account...
ID: 39047 · Report as offensive     Reply Quote
gyllic

Send message
Joined: 9 Dec 14
Posts: 201
Credit: 2,500,279
RAC: 651
Message 39083 - Posted: 8 Jun 2019, 22:50:45 UTC

The new version of the new CMS boostrap fixed the issue mentioned here: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4964&postid=39036. Thanks everybody.
ID: 39083 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4

Message boards : CMS Application : CMS jobs are becoming available again


©2019 CERN