Message boards :
News :
CMS@Home disruption, Monday 22nd July
Message board moderation
Author | Message |
---|---|
Send message Joined: 29 Aug 05 Posts: 1048 Credit: 7,510,992 RAC: 7,518 |
I've had the following notice from CERN/CMS IT: >> following the hypervisor reboot campaign, as announced by CERN IT here: https://cern.service-now.com/service-portal/view-outage.do?n=OTG0051185 >> the following VMs - under the CMS Production openstack project - will be rebooted on Monday July 22 (starting at 8:30am CERN time): ... >> | vocms0267 | cern-geneva-b | cms-home to which I replied: > Thanks, Alan. vocms0267 runs the CMS@Home campaign. Should I warn the volunteers of the disruption, or will it be mainly transparent? and received this reply: Running jobs will fail because they won't be able to connect to the schedd condor_shadow process. So this will be the visible impact on the users. There will be also a short time window (until I get the agent restarted) where there will be no jobs pending in the condor pool. So it might be worth it giving the users a heads up. So, my recommendation is that you set "No New Tasks" for CMS@Home sometime Sunday afternoon, to let tasks complete before the 0830 CST restart. I'll let you know as soon as Alan informs me that vocm0267 is up and running again |
Send message Joined: 18 Dec 15 Posts: 1749 Credit: 115,658,029 RAC: 87,149 |
thanks, Ivan, for the early information :-) |
Send message Joined: 29 Aug 05 Posts: 1048 Credit: 7,510,992 RAC: 7,518 |
|
Send message Joined: 15 Jun 08 Posts: 2506 Credit: 248,884,286 RAC: 129,129 |
https://cern.service-now.com/service-portal/view-outage.do?n=OTG0051244 According to this announcement the hypervisor reboot campaign of CERN's GitLab machines starts at Thu Jul 18, 2019 08:30 CEST. This may affect CMS VMs during their boot phase as they download singularity_wrapper.sh from gitlab.cern.ch. |
Send message Joined: 29 Aug 05 Posts: 1048 Credit: 7,510,992 RAC: 7,518 |
|
Send message Joined: 29 Aug 05 Posts: 1048 Credit: 7,510,992 RAC: 7,518 |
Apologies, I'm not going to be very active for a while, not from home at least. My water pipes have developed pin-hole leaks, corroding from inside, and dumped a steady stream of water onto my broadband modem. :-( It's not going to be easy to fix, the whole house needs renovation (plumbing, wiring, heating, new bathroom & kitchen, carpets, etc.). I'm going to have to find a new place to live while it's being done, I think. |
Send message Joined: 15 Jun 08 Posts: 2506 Credit: 248,884,286 RAC: 129,129 |
My water pipes have developed pin-hole leaks, corroding from inside, and dumped a steady stream of water onto my broadband modem. :-( Hence I vote for a new sticker on tobacco packages: "Smoking kills your internet connection!" Could be combined with pictures showing destroyed modems. |
Send message Joined: 18 Dec 15 Posts: 1749 Credit: 115,658,029 RAC: 87,149 |
Obviously, most of LHC@Home is still switched off. Any ongoing troubles with today's maintenance? |
Send message Joined: 30 Aug 14 Posts: 145 Credit: 10,847,070 RAC: 0 |
Same here, not only CMS, but ALL projects seem to be affected! Why mine when you can research? - GRIDCOIN - Real cryptocurrency without wasting hashes! https://gridcoin.us |
Send message Joined: 29 Aug 05 Posts: 1048 Credit: 7,510,992 RAC: 7,518 |
Obviously, most of LHC@Home is still switched off. Yes, there were: https://cern.service-now.com/service-portal/view-outage.do?n=OTG0051185. However, LHC@Home-dev is serving up tasks -- but they fail after ~15 minutes. WMStats shows jobs pending but none running. Our condor server does appear to be up, but there must be a blockage elsewhere in the chain. [Later] Just had a message that our condor server has been rebooted, am awaiting signs of life. [/Later] |
Send message Joined: 18 Dec 15 Posts: 1749 Credit: 115,658,029 RAC: 87,149 |
here, BOINC shows that finished tasks were uploaded, but still waiting for being reported. |
Send message Joined: 4 Dec 07 Posts: 6 Credit: 1,338,198 RAC: 0 |
Also here, Strange that upload seems to be OK. Only scheduler seems to work. All others disable. I have about 100 WU waiting validation and server report only 20 for all the world. Not forget to read the annoucment of past week ! here, BOINC shows that finished tasks were uploaded, but still waiting for being reported. |
Send message Joined: 18 Dec 15 Posts: 1749 Credit: 115,658,029 RAC: 87,149 |
right now, the server status page shows that only the scheduler has been startet - everything else is still down :-( What went wrong there today? |
Send message Joined: 28 Dec 08 Posts: 334 Credit: 4,803,132 RAC: 2,298 |
7/22/2019 11:25:17 PM (GMT+1) | LHC@home | Server error: feeder not running Can report tasks and they appear to be getting validated, but I do have 4 six track tasks stuck in the holding cell pending validation. The few CMS tasks I had in queue have validated and so did the ATLAS stuff. Just looks like you can get work right now. |
Send message Joined: 7 Aug 11 Posts: 93 Credit: 23,663,900 RAC: 29,344 |
Can't get new work, can't report completed work. Theory Native and Atlas VBox both affected. |
Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0 |
Many tasks ready to report both Linux and Windows. Tullio |
Send message Joined: 7 Aug 11 Posts: 93 Credit: 23,663,900 RAC: 29,344 |
Seems to be working now, thanks for the fix. |
Send message Joined: 15 Jun 08 Posts: 2506 Credit: 248,884,286 RAC: 129,129 |
Got a CMS task but it failed as there are no subtasks: https://lhcathome.cern.ch/lhcathome/result.php?resultid=237448622 |
Send message Joined: 29 Aug 05 Posts: 1048 Credit: 7,510,992 RAC: 7,518 |
|
Send message Joined: 29 Aug 05 Posts: 1048 Credit: 7,510,992 RAC: 7,518 |
There is currently disruption to CMS jobs -- there are jobs in the queue but they are not being served out. I notified CERN IT, but it would be best to set No New Tasks to avoid lots of task failures. Unfortunately I can't check in overnight; I'll update the situation tomorrow. Jobs are being served, but slowly. I submitted eight tasks this afternoon, and only one got a job to run before the ten-minute time-out. I'm keeping the CERN crew notified. |
©2024 CERN