Message boards :
News :
CMS production pause
Message board moderation
Author | Message |
---|---|
![]() Send message Joined: 29 Aug 05 Posts: 1067 Credit: 8,251,951 RAC: 9,400 ![]() |
We have run into a problem with the CMS project -- the merged result files processed at CERN are failing to be written to central storage. Consequently I have decided not to submit any more jobs until the experts have clarified what the problem is. The CMS jobs queue is about to start draining and I expect it to be empty of volunteer jobs within a few hours (there may still be post-production jobs, but these run at CERN, not on your machines). I suggest you set No New Tasks or transfer to another project until the situation is resolved. ![]() |
![]() Send message Joined: 29 Aug 05 Posts: 1067 Credit: 8,251,951 RAC: 9,400 ![]() |
There is also a problem affecting the CERN VMs that run the post-production jobs, as T3_CH_CMSAtHome. According to the CERN Service Portal these are still not completely resolved. I had submitted a couple of small job batches to test the failure mentioned above, but since they do not get beyond the Merge step with T3_CH_CMSAtHome hors de combat, it's not worth doing that until it is working again. ![]() |
![]() Send message Joined: 29 Aug 05 Posts: 1067 Credit: 8,251,951 RAC: 9,400 ![]() |
|
Send message Joined: 18 Dec 15 Posts: 1838 Credit: 122,548,525 RAC: 118,289 ![]() ![]() ![]() |
Ivan, any idea when CMS will be on again? |
![]() Send message Joined: 29 Aug 05 Posts: 1067 Credit: 8,251,951 RAC: 9,400 ![]() |
OK, I'm pleased to announce that we have overcome some of the problems we were having (and some more in the meantime due to CMS deciding that all its production should run in singularity containers -- in our case inside the usual virtual machines). My last test batch of ~56 test jobs completed Production successfully, and two Merge jobs also ran and the merged jobs have appeared on the CERN storage: [eesridr@pion:~] > gfal-ls -l srm://srm-eoscms.cern.ch:8443/srm/v2/server?SFN=/eos/cms/store/backfill/1/DMWM_Test/QCD_Pt-40toInf_fwdJet_bwdJet_Tune4C_2p76TeV-pythia8/GEN-SIM/MonteCarlo_eff2_CMS_Home_IDRv3co-v11/00000 ---------- 1 0 0 2147533687 Aug 16 12:15 46C4E615-45A1-E811-8C38-02163E00D1A4.root ---------- 1 0 0 1960180311 Aug 16 12:13 7EC3E615-45A1-E811-81EB-02163E00D1A4.root I've now submitted a ten-times larger batch so that we can look for remaining problems. If you wish, you may now tentatively start CMS tasks again, so that we complete the batch in a shorter time. Work will continue to be patchy, but we will be working towards normal production runs soon (the official conveners of the CMS Opportunistic Computing group are both on holiday at the moment, so expect test runs for the next two weeks or so before liaison with Official Production can start). Thank you for your patience, and I hope the good news continues from now on! ![]() |
![]() Send message Joined: 30 Aug 14 Posts: 145 Credit: 10,847,070 RAC: 0 ![]() ![]() |
This is great news, thanks for all efforts! Just a short question: Does the RAM requirement for the VM of 2048MB stay the same, or is more RAM needed? Regards, djoser. Why mine when you can research? - GRIDCOIN - Real cryptocurrency without wasting hashes! https://gridcoin.us |
![]() Send message Joined: 29 Aug 05 Posts: 1067 Credit: 8,251,951 RAC: 9,400 ![]() |
This is great news, thanks for all efforts! I believe it stays the same, I've not heard anything to the contrary. The container runs within the original VM (that's not changed) so the size wouldn't change. I currently have one machine running 12(!) tasks, and it still has 38 GB free out of 64 GB, so 26 GB being used looks like 2 GB or less per VM after overheads. ![]() |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 ![]() ![]() |
CMS deciding that all its production should run in singularity containers Free CMS! Forcing a task to live in a VBox is task abuse. CMS tasks are obviously sentient, they want to run in singularity. Set them free. |
Send message Joined: 14 Jan 10 Posts: 1439 Credit: 9,629,765 RAC: 2,577 ![]() ![]() ![]() |
Great to hear from you again, Ivan and apparently some progress for CMS again. I got a new BOINC-task running. cmsRun busy (1st jobid f82a5cf9-5a9f-46de-9421-a1d6de90e647-20_0), but in VM-Console no job-output for the running job (Alt-F2 - running.log) and probably no output from the wrapper (Alt-F4 - stdout.log), but have to wait until the 1st job finished. |
Send message Joined: 9 Dec 14 Posts: 202 Credit: 2,533,875 RAC: 0 ![]() ![]() |
OK, I'm pleased to announce that we have overcome some of the problems we were having (and some more in the meantime due to CMS deciding that all its production should run in singularity containers -- in our case inside the usual virtual machines).Great news that some of the problems have been solved. Although ATLAS and CMS are using different approaches, are you thinking of releasing a "native" app for linux hosts similar to ATLAS since you are using singularity now? |
![]() Send message Joined: 29 Aug 05 Posts: 1067 Credit: 8,251,951 RAC: 9,400 ![]() |
CMS deciding that all its production should run in singularity containers Yeah, we wish... I wasn't quite sure about running a task within a container within a VM, but that seems to be the way the world is turning. I was also a bit surprised that cmsRun shows up in "top" (Alt-F3 after you select a CMS task and click on "Show VM Console") when it's supposed to be running within a container, but my tame Brazilian computer-science professor assured me that that is normal for the older Linux kernel we are running in our VM. Glad he was around rather than embarrassing myself asking in our CMS forums... ![]() |
![]() Send message Joined: 29 Aug 05 Posts: 1067 Credit: 8,251,951 RAC: 9,400 ![]() |
OK, I'm pleased to announce that we have overcome some of the problems we were having (and some more in the meantime due to CMS deciding that all its production should run in singularity containers -- in our case inside the usual virtual machines).Great news that some of the problems have been solved. Not that I'm aware. As far as I understand it, the CMS "production" approach at the moment is to run a CentOS7 image within a singularity container that can be running in either a Scientific Linux 6 or a CentOS7 host (or maybe other Linux versions depending on the site). Our current VM is SLC 6.7, and there are no immediate plans to change that. We are in a state of flux, building up to "production" taking over job submission (Hoorah, no more sleepless nights!) so things may well change in the future, but not immediately as far as I can see. We definitely don't want to alienate our Windows and Mac volunteers! ![]() |
![]() Send message Joined: 20 Jun 14 Posts: 381 Credit: 238,712 RAC: 0 ![]() ![]() |
The log file is now hidden within the container. We still need to find a way to expose this. |
![]() Send message Joined: 29 Aug 05 Posts: 1067 Credit: 8,251,951 RAC: 9,400 ![]() |
Yes, we're consulting about this, modulo problems with everyone taking summer holidays. (Why can't they do like we do at home, taking summer holidays at Christmas/New Year?) ![]() |
![]() Send message Joined: 20 Jun 14 Posts: 381 Credit: 238,712 RAC: 0 ![]() ![]() |
Although ATLAS and CMS are using different approaches, are you thinking of releasing a "native" app for linux hosts similar to ATLAS since you are using singularity now? This is not linked. Just because CMS is using singularly does not necessarily mean we can make a native app. However, this is something we could consider but we would end up with a container in container. It is a direction we could go for all the VM applications. I do like the native ATLAS application and also the boinc2docker work that Marius has done for Cosmology@home. Focusing on the native app and running singularity in a VM for Windows and Mac, similar to how Docker works, could be a direction for the future. |
Send message Joined: 14 Jan 10 Posts: 1439 Credit: 9,629,765 RAC: 2,577 ![]() ![]() ![]() |
Suspending for a longer period (hours) and then resuming the new container based CMS-task results into a computation error, although several jobs inside the VM were returned successfully. Example task: https://lhcathome.cern.ch/lhcathome/result.php?resultid=205067122 MasterLog, StartLog and StarterLog available. |
Send message Joined: 6 Sep 08 Posts: 118 Credit: 12,654,526 RAC: 4,089 ![]() ![]() ![]() |
Another one,,https://lhcathome.cern.ch/lhcathome/result.php?resultid=205073544 In pinciple this doesn't seem to be a new problem; all the current VM projects suffer.- it's just worse. Tasks seem to fail on restart if the wrapper doesn't "see" that a job been completed.. Previously this information appeared to be "saved" over the shutdown so a failure only occurred if no task had been completed before the shutdown.(I've got lots of these...) This "saving" no longer happens, or it's hidden inside the container, so tasks fail. It's probably more complicated than this, but this is how it seems to behave here. |
![]() Send message Joined: 15 Jun 08 Posts: 2571 Credit: 258,913,213 RAC: 118,938 ![]() ![]() |
Some comments. 1. All my WUs fail although they definitely produce and upload intermediate results. Thus no credits. https://lhcathome.cern.ch/lhcathome/result.php?resultid=205363909 https://lhcathome.cern.ch/lhcathome/result.php?resultid=205360074 2. Consoles 2 and 4 still don't show the expected output. 3. It takes roughly 20 minutes on a 50/10 Mbit internet connection until all necessary files are downloaded and cmsRun starts working. 4. A local squid with all my "CERN extras" activated shortens the startup phase of a second WU to roughly 8 minutes. 5. cmsfrontier3.cern.ch currently fails (and is no longer included in the DNS record). |
![]() ![]() Send message Joined: 24 Oct 04 Posts: 1184 Credit: 57,400,876 RAC: 64,331 ![]() ![]() |
Why are the Dockers being tested at Cosmology instead of the Atlas test site?? On another CMS note I see that CMS will still not d/l over at -dev since I decided to try those again since the server said we had some and wanted to see if they came back to life since we had members trying them here again. I only tried to get one -dev CMS task since I didn't think it would work and it did d/l for a couple hours (that vdi) and then crashed and said d/l failed. (no big deal) and back to just running the multi-core LHCb's |
![]() Send message Joined: 15 Jun 08 Posts: 2571 Credit: 258,913,213 RAC: 118,938 ![]() ![]() |
5. cmsfrontier3.cern.ch currently fails (and is no longer included in the DNS record). It seems that cmsfrontier3 has changed it's IP. As of today a new IP appears in the DNS records. |
©2025 CERN