Message boards : News : CMS production pause
Message board moderation

To post messages, you must log in.

AuthorMessage
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 488
Credit: 3,834,148
RAC: 3,511
Message 35079 - Posted: 23 Apr 2018, 15:50:52 UTC

We have run into a problem with the CMS project -- the merged result files processed at CERN are failing to be written to central storage. Consequently I have decided not to submit any more jobs until the experts have clarified what the problem is. The CMS jobs queue is about to start draining and I expect it to be empty of volunteer jobs within a few hours (there may still be post-production jobs, but these run at CERN, not on your machines). I suggest you set No New Tasks or transfer to another project until the situation is resolved.
ID: 35079 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 488
Credit: 3,834,148
RAC: 3,511
Message 35104 - Posted: 28 Apr 2018, 17:23:17 UTC - in response to Message 35079.  
Last modified: 29 Apr 2018, 9:32:31 UTC

There is also a problem affecting the CERN VMs that run the post-production jobs, as T3_CH_CMSAtHome. According to the CERN Service Portal these are still not completely resolved.
I had submitted a couple of small job batches to test the failure mentioned above, but since they do not get beyond the Merge step with T3_CH_CMSAtHome hors de combat, it's not worth doing that until it is working again.
ID: 35104 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 488
Credit: 3,834,148
RAC: 3,511
Message 35110 - Posted: 29 Apr 2018, 9:34:24 UTC - in response to Message 35104.  

T3_CH_CMSAtHome is up again, but it appears we still can't access central storage. I've submitted another small batch to check progress.
ID: 35110 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 804
Credit: 5,763,701
RAC: 9,978
Message 35481 - Posted: 11 Jun 2018, 5:03:35 UTC

Ivan, any idea when CMS will be on again?
ID: 35481 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 488
Credit: 3,834,148
RAC: 3,511
Message 36457 - Posted: 16 Aug 2018, 12:10:45 UTC

OK, I'm pleased to announce that we have overcome some of the problems we were having (and some more in the meantime due to CMS deciding that all its production should run in singularity containers -- in our case inside the usual virtual machines).
My last test batch of ~56 test jobs completed Production successfully, and two Merge jobs also ran and the merged jobs have appeared on the CERN storage:

[eesridr@pion:~] > gfal-ls -l srm://srm-eoscms.cern.ch:8443/srm/v2/server?SFN=/eos/cms/store/backfill/1/DMWM_Test/QCD_Pt-40toInf_fwdJet_bwdJet_Tune4C_2p76TeV-pythia8/GEN-SIM/MonteCarlo_eff2_CMS_Home_IDRv3co-v11/00000
---------- 1 0 0 2147533687 Aug 16 12:15 46C4E615-45A1-E811-8C38-02163E00D1A4.root
---------- 1 0 0 1960180311 Aug 16 12:13 7EC3E615-45A1-E811-81EB-02163E00D1A4.root


I've now submitted a ten-times larger batch so that we can look for remaining problems. If you wish, you may now tentatively start CMS tasks again, so that we complete the batch in a shorter time. Work will continue to be patchy, but we will be working towards normal production runs soon (the official conveners of the CMS Opportunistic Computing group are both on holiday at the moment, so expect test runs for the next two weeks or so before liaison with Official Production can start).

Thank you for your patience, and I hope the good news continues from now on!
ID: 36457 · Report as offensive     Reply Quote
djoser

Send message
Joined: 30 Aug 14
Posts: 28
Credit: 3,031,840
RAC: 2,850
Message 36459 - Posted: 16 Aug 2018, 14:02:39 UTC - in response to Message 36457.  

This is great news, thanks for all efforts!

Just a short question:
Does the RAM requirement for the VM of 2048MB stay the same, or is more RAM needed?

Regards, djoser.
Why mine when you can research? - GRIDCOIN - Real cryptocurrency without wasting hashes! www.gridcoin.us
ID: 36459 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 488
Credit: 3,834,148
RAC: 3,511
Message 36462 - Posted: 16 Aug 2018, 14:18:45 UTC - in response to Message 36459.  

This is great news, thanks for all efforts!

Just a short question:
Does the RAM requirement for the VM of 2048MB stay the same, or is more RAM needed?

I believe it stays the same, I've not heard anything to the contrary. The container runs within the original VM (that's not changed) so the size wouldn't change. I currently have one machine running 12(!) tasks, and it still has 38 GB free out of 64 GB, so 26 GB being used looks like 2 GB or less per VM after overheads.
ID: 36462 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 137
Credit: 640,333
RAC: 11,817
Message 36464 - Posted: 16 Aug 2018, 14:21:08 UTC - in response to Message 36457.  
Last modified: 16 Aug 2018, 14:22:48 UTC

CMS deciding that all its production should run in singularity containers

Free CMS! Forcing a task to live in a VBox is task abuse. CMS tasks are obviously sentient, they want to run in singularity. Set them free.
ID: 36464 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 480
Credit: 3,368,441
RAC: 966
Message 36465 - Posted: 16 Aug 2018, 14:22:18 UTC - in response to Message 36457.  

Great to hear from you again, Ivan and apparently some progress for CMS again.

I got a new BOINC-task running.
cmsRun busy (1st jobid f82a5cf9-5a9f-46de-9421-a1d6de90e647-20_0),
but in VM-Console no job-output for the running job (Alt-F2 - running.log) and probably no output from the wrapper (Alt-F4 - stdout.log), but have to wait until the 1st job finished.
ID: 36465 · Report as offensive     Reply Quote
gyllic

Send message
Joined: 9 Dec 14
Posts: 154
Credit: 1,700,602
RAC: 1,401
Message 36467 - Posted: 16 Aug 2018, 14:44:37 UTC - in response to Message 36457.  
Last modified: 16 Aug 2018, 14:46:55 UTC

OK, I'm pleased to announce that we have overcome some of the problems we were having (and some more in the meantime due to CMS deciding that all its production should run in singularity containers -- in our case inside the usual virtual machines).
Great news that some of the problems have been solved.

Although ATLAS and CMS are using different approaches, are you thinking of releasing a "native" app for linux hosts similar to ATLAS since you are using singularity now?
ID: 36467 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 488
Credit: 3,834,148
RAC: 3,511
Message 36469 - Posted: 16 Aug 2018, 17:39:31 UTC - in response to Message 36464.  

CMS deciding that all its production should run in singularity containers

Free CMS! Forcing a task to live in a VBox is task abuse. CMS tasks are obviously sentient, they want to run in singularity. Set them free.

Yeah, we wish... I wasn't quite sure about running a task within a container within a VM, but that seems to be the way the world is turning.
I was also a bit surprised that cmsRun shows up in "top" (Alt-F3 after you select a CMS task and click on "Show VM Console") when it's supposed to be running within a container, but my tame Brazilian computer-science professor assured me that that is normal for the older Linux kernel we are running in our VM. Glad he was around rather than embarrassing myself asking in our CMS forums...
ID: 36469 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 488
Credit: 3,834,148
RAC: 3,511
Message 36470 - Posted: 16 Aug 2018, 17:47:32 UTC - in response to Message 36467.  

OK, I'm pleased to announce that we have overcome some of the problems we were having (and some more in the meantime due to CMS deciding that all its production should run in singularity containers -- in our case inside the usual virtual machines).
Great news that some of the problems have been solved.

Although ATLAS and CMS are using different approaches, are you thinking of releasing a "native" app for linux hosts similar to ATLAS since you are using singularity now?

Not that I'm aware. As far as I understand it, the CMS "production" approach at the moment is to run a CentOS7 image within a singularity container that can be running in either a Scientific Linux 6 or a CentOS7 host (or maybe other Linux versions depending on the site). Our current VM is SLC 6.7, and there are no immediate plans to change that. We are in a state of flux, building up to "production" taking over job submission (Hoorah, no more sleepless nights!) so things may well change in the future, but not immediately as far as I can see. We definitely don't want to alienate our Windows and Mac volunteers!
ID: 36470 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 198
Credit: 212,014
RAC: 141
Message 36472 - Posted: 16 Aug 2018, 21:00:17 UTC - in response to Message 36465.  


I got a new BOINC-task running.
cmsRun busy (1st jobid f82a5cf9-5a9f-46de-9421-a1d6de90e647-20_0),
but in VM-Console no job-output for the running job (Alt-F2 - running.log) and probably no output from the wrapper (Alt-F4 - stdout.log), but have to wait until the 1st job finished.


The log file is now hidden within the container. We still need to find a way to expose this.
ID: 36472 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 488
Credit: 3,834,148
RAC: 3,511
Message 36473 - Posted: 16 Aug 2018, 21:04:23 UTC - in response to Message 36472.  


I got a new BOINC-task running.
cmsRun busy (1st jobid f82a5cf9-5a9f-46de-9421-a1d6de90e647-20_0),
but in VM-Console no job-output for the running job (Alt-F2 - running.log) and probably no output from the wrapper (Alt-F4 - stdout.log), but have to wait until the 1st job finished.


The log file is now hidden within the container. We still need to find a way to expose this.

Yes, we're consulting about this, modulo problems with everyone taking summer holidays. (Why can't they do like we do at home, taking summer holidays at Christmas/New Year?)
ID: 36473 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 198
Credit: 212,014
RAC: 141
Message 36474 - Posted: 16 Aug 2018, 21:10:50 UTC - in response to Message 36467.  

Although ATLAS and CMS are using different approaches, are you thinking of releasing a "native" app for linux hosts similar to ATLAS since you are using singularity now?


This is not linked. Just because CMS is using singularly does not necessarily mean we can make a native app. However, this is something we could consider but we would end up with a container in container. It is a direction we could go for all the VM applications. I do like the native ATLAS application and also the boinc2docker work that Marius has done for Cosmology@home. Focusing on the native app and running singularity in a VM for Windows and Mac, similar to how Docker works, could be a direction for the future.
ID: 36474 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 480
Credit: 3,368,441
RAC: 966
Message 36476 - Posted: 17 Aug 2018, 6:26:07 UTC

Suspending for a longer period (hours) and then resuming the new container based CMS-task results into a computation error, although several jobs inside the VM were returned successfully.

Example task: https://lhcathome.cern.ch/lhcathome/result.php?resultid=205067122
ID: 36476 · Report as offensive     Reply Quote

Message boards : News : CMS production pause


©2018 CERN