Message boards : News : CMS production pause
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 990
Credit: 6,264,307
RAC: 191
Message 35079 - Posted: 23 Apr 2018, 15:50:52 UTC

We have run into a problem with the CMS project -- the merged result files processed at CERN are failing to be written to central storage. Consequently I have decided not to submit any more jobs until the experts have clarified what the problem is. The CMS jobs queue is about to start draining and I expect it to be empty of volunteer jobs within a few hours (there may still be post-production jobs, but these run at CERN, not on your machines). I suggest you set No New Tasks or transfer to another project until the situation is resolved.
ID: 35079 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 990
Credit: 6,264,307
RAC: 191
Message 35104 - Posted: 28 Apr 2018, 17:23:17 UTC - in response to Message 35079.  
Last modified: 29 Apr 2018, 9:32:31 UTC

There is also a problem affecting the CERN VMs that run the post-production jobs, as T3_CH_CMSAtHome. According to the CERN Service Portal these are still not completely resolved.
I had submitted a couple of small job batches to test the failure mentioned above, but since they do not get beyond the Merge step with T3_CH_CMSAtHome hors de combat, it's not worth doing that until it is working again.
ID: 35104 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 990
Credit: 6,264,307
RAC: 191
Message 35110 - Posted: 29 Apr 2018, 9:34:24 UTC - in response to Message 35104.  

T3_CH_CMSAtHome is up again, but it appears we still can't access central storage. I've submitted another small batch to check progress.
ID: 35110 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1681
Credit: 99,381,381
RAC: 110,980
Message 35481 - Posted: 11 Jun 2018, 5:03:35 UTC

Ivan, any idea when CMS will be on again?
ID: 35481 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 990
Credit: 6,264,307
RAC: 191
Message 36457 - Posted: 16 Aug 2018, 12:10:45 UTC

OK, I'm pleased to announce that we have overcome some of the problems we were having (and some more in the meantime due to CMS deciding that all its production should run in singularity containers -- in our case inside the usual virtual machines).
My last test batch of ~56 test jobs completed Production successfully, and two Merge jobs also ran and the merged jobs have appeared on the CERN storage:

[eesridr@pion:~] > gfal-ls -l srm://srm-eoscms.cern.ch:8443/srm/v2/server?SFN=/eos/cms/store/backfill/1/DMWM_Test/QCD_Pt-40toInf_fwdJet_bwdJet_Tune4C_2p76TeV-pythia8/GEN-SIM/MonteCarlo_eff2_CMS_Home_IDRv3co-v11/00000
---------- 1 0 0 2147533687 Aug 16 12:15 46C4E615-45A1-E811-8C38-02163E00D1A4.root
---------- 1 0 0 1960180311 Aug 16 12:13 7EC3E615-45A1-E811-81EB-02163E00D1A4.root


I've now submitted a ten-times larger batch so that we can look for remaining problems. If you wish, you may now tentatively start CMS tasks again, so that we complete the batch in a shorter time. Work will continue to be patchy, but we will be working towards normal production runs soon (the official conveners of the CMS Opportunistic Computing group are both on holiday at the moment, so expect test runs for the next two weeks or so before liaison with Official Production can start).

Thank you for your patience, and I hope the good news continues from now on!
ID: 36457 · Report as offensive     Reply Quote
djoser
Avatar

Send message
Joined: 30 Aug 14
Posts: 145
Credit: 10,847,070
RAC: 0
Message 36459 - Posted: 16 Aug 2018, 14:02:39 UTC - in response to Message 36457.  

This is great news, thanks for all efforts!

Just a short question:
Does the RAM requirement for the VM of 2048MB stay the same, or is more RAM needed?

Regards, djoser.
Why mine when you can research? - GRIDCOIN - Real cryptocurrency without wasting hashes! https://gridcoin.us
ID: 36459 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 990
Credit: 6,264,307
RAC: 191
Message 36462 - Posted: 16 Aug 2018, 14:18:45 UTC - in response to Message 36459.  

This is great news, thanks for all efforts!

Just a short question:
Does the RAM requirement for the VM of 2048MB stay the same, or is more RAM needed?

I believe it stays the same, I've not heard anything to the contrary. The container runs within the original VM (that's not changed) so the size wouldn't change. I currently have one machine running 12(!) tasks, and it still has 38 GB free out of 64 GB, so 26 GB being used looks like 2 GB or less per VM after overheads.
ID: 36462 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 36464 - Posted: 16 Aug 2018, 14:21:08 UTC - in response to Message 36457.  
Last modified: 16 Aug 2018, 14:22:48 UTC

CMS deciding that all its production should run in singularity containers

Free CMS! Forcing a task to live in a VBox is task abuse. CMS tasks are obviously sentient, they want to run in singularity. Set them free.
ID: 36464 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1263
Credit: 8,420,582
RAC: 5,321
Message 36465 - Posted: 16 Aug 2018, 14:22:18 UTC - in response to Message 36457.  

Great to hear from you again, Ivan and apparently some progress for CMS again.

I got a new BOINC-task running.
cmsRun busy (1st jobid f82a5cf9-5a9f-46de-9421-a1d6de90e647-20_0),
but in VM-Console no job-output for the running job (Alt-F2 - running.log) and probably no output from the wrapper (Alt-F4 - stdout.log), but have to wait until the 1st job finished.
ID: 36465 · Report as offensive     Reply Quote
gyllic

Send message
Joined: 9 Dec 14
Posts: 202
Credit: 2,533,875
RAC: 0
Message 36467 - Posted: 16 Aug 2018, 14:44:37 UTC - in response to Message 36457.  
Last modified: 16 Aug 2018, 14:46:55 UTC

OK, I'm pleased to announce that we have overcome some of the problems we were having (and some more in the meantime due to CMS deciding that all its production should run in singularity containers -- in our case inside the usual virtual machines).
Great news that some of the problems have been solved.

Although ATLAS and CMS are using different approaches, are you thinking of releasing a "native" app for linux hosts similar to ATLAS since you are using singularity now?
ID: 36467 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 990
Credit: 6,264,307
RAC: 191
Message 36469 - Posted: 16 Aug 2018, 17:39:31 UTC - in response to Message 36464.  

CMS deciding that all its production should run in singularity containers

Free CMS! Forcing a task to live in a VBox is task abuse. CMS tasks are obviously sentient, they want to run in singularity. Set them free.

Yeah, we wish... I wasn't quite sure about running a task within a container within a VM, but that seems to be the way the world is turning.
I was also a bit surprised that cmsRun shows up in "top" (Alt-F3 after you select a CMS task and click on "Show VM Console") when it's supposed to be running within a container, but my tame Brazilian computer-science professor assured me that that is normal for the older Linux kernel we are running in our VM. Glad he was around rather than embarrassing myself asking in our CMS forums...
ID: 36469 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 990
Credit: 6,264,307
RAC: 191
Message 36470 - Posted: 16 Aug 2018, 17:47:32 UTC - in response to Message 36467.  

OK, I'm pleased to announce that we have overcome some of the problems we were having (and some more in the meantime due to CMS deciding that all its production should run in singularity containers -- in our case inside the usual virtual machines).
Great news that some of the problems have been solved.

Although ATLAS and CMS are using different approaches, are you thinking of releasing a "native" app for linux hosts similar to ATLAS since you are using singularity now?

Not that I'm aware. As far as I understand it, the CMS "production" approach at the moment is to run a CentOS7 image within a singularity container that can be running in either a Scientific Linux 6 or a CentOS7 host (or maybe other Linux versions depending on the site). Our current VM is SLC 6.7, and there are no immediate plans to change that. We are in a state of flux, building up to "production" taking over job submission (Hoorah, no more sleepless nights!) so things may well change in the future, but not immediately as far as I can see. We definitely don't want to alienate our Windows and Mac volunteers!
ID: 36470 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 371
Credit: 238,712
RAC: 0
Message 36472 - Posted: 16 Aug 2018, 21:00:17 UTC - in response to Message 36465.  


I got a new BOINC-task running.
cmsRun busy (1st jobid f82a5cf9-5a9f-46de-9421-a1d6de90e647-20_0),
but in VM-Console no job-output for the running job (Alt-F2 - running.log) and probably no output from the wrapper (Alt-F4 - stdout.log), but have to wait until the 1st job finished.


The log file is now hidden within the container. We still need to find a way to expose this.
ID: 36472 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 990
Credit: 6,264,307
RAC: 191
Message 36473 - Posted: 16 Aug 2018, 21:04:23 UTC - in response to Message 36472.  


I got a new BOINC-task running.
cmsRun busy (1st jobid f82a5cf9-5a9f-46de-9421-a1d6de90e647-20_0),
but in VM-Console no job-output for the running job (Alt-F2 - running.log) and probably no output from the wrapper (Alt-F4 - stdout.log), but have to wait until the 1st job finished.


The log file is now hidden within the container. We still need to find a way to expose this.

Yes, we're consulting about this, modulo problems with everyone taking summer holidays. (Why can't they do like we do at home, taking summer holidays at Christmas/New Year?)
ID: 36473 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 371
Credit: 238,712
RAC: 0
Message 36474 - Posted: 16 Aug 2018, 21:10:50 UTC - in response to Message 36467.  

Although ATLAS and CMS are using different approaches, are you thinking of releasing a "native" app for linux hosts similar to ATLAS since you are using singularity now?


This is not linked. Just because CMS is using singularly does not necessarily mean we can make a native app. However, this is something we could consider but we would end up with a container in container. It is a direction we could go for all the VM applications. I do like the native ATLAS application and also the boinc2docker work that Marius has done for Cosmology@home. Focusing on the native app and running singularity in a VM for Windows and Mac, similar to how Docker works, could be a direction for the future.
ID: 36474 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1263
Credit: 8,420,582
RAC: 5,321
Message 36476 - Posted: 17 Aug 2018, 6:26:07 UTC
Last modified: 17 Aug 2018, 7:40:16 UTC

Suspending for a longer period (hours) and then resuming the new container based CMS-task results into a computation error, although several jobs inside the VM were returned successfully.

Example task: https://lhcathome.cern.ch/lhcathome/result.php?resultid=205067122

MasterLog, StartLog and StarterLog available.
ID: 36476 · Report as offensive     Reply Quote
m

Send message
Joined: 6 Sep 08
Posts: 115
Credit: 10,921,565
RAC: 5,347
Message 36479 - Posted: 17 Aug 2018, 9:27:39 UTC - in response to Message 36476.  
Last modified: 17 Aug 2018, 9:33:22 UTC

Another one,,https://lhcathome.cern.ch/lhcathome/result.php?resultid=205073544

In pinciple this doesn't seem to be a new problem; all the current VM projects suffer.- it's just worse.
Tasks seem to fail on restart if the wrapper doesn't "see" that a job been completed..
Previously this information appeared to be "saved" over the shutdown
so a failure only occurred if no task had been completed before the shutdown.(I've got lots of these...)
This "saving" no longer happens, or it's hidden inside the container, so tasks fail.

It's probably more complicated than this, but this is how it seems to behave here.
ID: 36479 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2375
Credit: 221,689,862
RAC: 142,909
Message 36530 - Posted: 22 Aug 2018, 18:27:16 UTC

Some comments.

1. All my WUs fail although they definitely produce and upload intermediate results.
Thus no credits.
https://lhcathome.cern.ch/lhcathome/result.php?resultid=205363909
https://lhcathome.cern.ch/lhcathome/result.php?resultid=205360074

2. Consoles 2 and 4 still don't show the expected output.

3. It takes roughly 20 minutes on a 50/10 Mbit internet connection until all necessary files are downloaded and cmsRun starts working.

4. A local squid with all my "CERN extras" activated shortens the startup phase of a second WU to roughly 8 minutes.

5. cmsfrontier3.cern.ch currently fails (and is no longer included in the DNS record).
ID: 36530 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1112
Credit: 49,477,561
RAC: 6,363
Message 36531 - Posted: 22 Aug 2018, 20:18:10 UTC - in response to Message 36474.  



This is not linked. Just because CMS is using singularly does not necessarily mean we can make a native app. However, this is something we could consider but we would end up with a container in container. It is a direction we could go for all the VM applications. I do like the native ATLAS application and also the boinc2docker work that Marius has done for Cosmology@home. Focusing on the native app and running singularity in a VM for Windows and Mac, similar to how Docker works, could be a direction for the future.


Why are the Dockers being tested at Cosmology instead of the Atlas test site??

On another CMS note I see that CMS will still not d/l over at -dev since I decided to try those again since the server said we had some and wanted to see if they came back to life since we had members trying them here again.

I only tried to get one -dev CMS task since I didn't think it would work and it did d/l for a couple hours (that vdi) and then crashed and said d/l failed. (no big deal) and back to just running the multi-core LHCb's
ID: 36531 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2375
Credit: 221,689,862
RAC: 142,909
Message 36533 - Posted: 23 Aug 2018, 5:41:43 UTC - in response to Message 36530.  

5. cmsfrontier3.cern.ch currently fails (and is no longer included in the DNS record).

It seems that cmsfrontier3 has changed it's IP.
As of today a new IP appears in the DNS records.
ID: 36533 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : News : CMS production pause


©2024 CERN