Message boards : News : CMS@Home disruption, Monday 22nd July
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1048
Credit: 7,477,327
RAC: 7,692
Message 39376 - Posted: 17 Jul 2019, 13:14:12 UTC

I've had the following notice from CERN/CMS IT:

>> following the hypervisor reboot campaign, as announced by CERN IT here: https://cern.service-now.com/service-portal/view-outage.do?n=OTG0051185
>> the following VMs - under the CMS Production openstack project - will be rebooted on Monday July 22 (starting at 8:30am CERN time):
...
>> | vocms0267 | cern-geneva-b | cms-home

to which I replied:
> Thanks, Alan. vocms0267 runs the CMS@Home campaign. Should I warn the volunteers of the disruption, or will it be mainly transparent?

and received this reply:
Running jobs will fail because they won't be able to connect to the schedd condor_shadow process. So this will be the visible impact on the users. There will be also a short time window (until I get the agent restarted) where there will be no jobs pending in the condor pool.
So it might be worth it giving the users a heads up.

So, my recommendation is that you set "No New Tasks" for CMS@Home sometime Sunday afternoon, to let tasks complete before the 0830 CST restart. I'll let you know as soon as Alan informs me that vocm0267 is up and running again
ID: 39376 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1748
Credit: 115,243,732
RAC: 90,566
Message 39377 - Posted: 17 Jul 2019, 15:21:38 UTC - in response to Message 39376.  

thanks, Ivan, for the early information :-)
ID: 39377 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1048
Credit: 7,477,327
RAC: 7,692
Message 39378 - Posted: 17 Jul 2019, 15:49:49 UTC - in response to Message 39377.  

OK, my current batch will run down tomorrow night, I think. I'll try to size the next one to drain late Sunday. Federica has a workflow in the pipeline, I'll wait to see how long that will take before I submit my next one.
ID: 39378 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2500
Credit: 248,159,167
RAC: 120,486
Message 39380 - Posted: 17 Jul 2019, 20:33:36 UTC

https://cern.service-now.com/service-portal/view-outage.do?n=OTG0051244
According to this announcement the hypervisor reboot campaign of CERN's GitLab machines starts at Thu Jul 18, 2019 08:30 CEST.
This may affect CMS VMs during their boot phase as they download singularity_wrapper.sh from gitlab.cern.ch.
ID: 39380 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1048
Credit: 7,477,327
RAC: 7,692
Message 39382 - Posted: 18 Jul 2019, 10:00:11 UTC - in response to Message 39380.  

I think Alan's warning over-rides the general message, but let's be wary in any case.
ID: 39382 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1048
Credit: 7,477,327
RAC: 7,692
Message 39408 - Posted: 21 Jul 2019, 7:09:33 UTC

Apologies, I'm not going to be very active for a while, not from home at least. My water pipes have developed pin-hole leaks, corroding from inside, and dumped a steady stream of water onto my broadband modem. :-( It's not going to be easy to fix, the whole house needs renovation (plumbing, wiring, heating, new bathroom & kitchen, carpets, etc.). I'm going to have to find a new place to live while it's being done, I think.
ID: 39408 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2500
Credit: 248,159,167
RAC: 120,486
Message 39410 - Posted: 21 Jul 2019, 18:38:28 UTC - in response to Message 39408.  

My water pipes have developed pin-hole leaks, corroding from inside, and dumped a steady stream of water onto my broadband modem. :-(

Hence I vote for a new sticker on tobacco packages:
"Smoking kills your internet connection!"
Could be combined with pictures showing destroyed modems.
ID: 39410 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1748
Credit: 115,243,732
RAC: 90,566
Message 39411 - Posted: 22 Jul 2019, 15:39:15 UTC

Obviously, most of LHC@Home is still switched off.
Any ongoing troubles with today's maintenance?
ID: 39411 · Report as offensive     Reply Quote
djoser
Avatar

Send message
Joined: 30 Aug 14
Posts: 145
Credit: 10,847,070
RAC: 0
Message 39412 - Posted: 22 Jul 2019, 15:40:30 UTC

Same here,
not only CMS, but ALL projects seem to be affected!
Why mine when you can research? - GRIDCOIN - Real cryptocurrency without wasting hashes! https://gridcoin.us
ID: 39412 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1048
Credit: 7,477,327
RAC: 7,692
Message 39414 - Posted: 22 Jul 2019, 15:45:11 UTC - in response to Message 39411.  
Last modified: 22 Jul 2019, 16:06:22 UTC

Obviously, most of LHC@Home is still switched off.
Any ongoing troubles with today's maintenance?

Yes, there were: https://cern.service-now.com/service-portal/view-outage.do?n=OTG0051185.
However, LHC@Home-dev is serving up tasks -- but they fail after ~15 minutes. WMStats shows jobs pending but none running. Our condor server does appear to be up, but there must be a blockage elsewhere in the chain.
[Later] Just had a message that our condor server has been rebooted, am awaiting signs of life. [/Later]
ID: 39414 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1748
Credit: 115,243,732
RAC: 90,566
Message 39415 - Posted: 22 Jul 2019, 16:28:03 UTC

here, BOINC shows that finished tasks were uploaded, but still waiting for being reported.
ID: 39415 · Report as offensive     Reply Quote
marsinph

Send message
Joined: 4 Dec 07
Posts: 6
Credit: 1,338,198
RAC: 0
Message 39416 - Posted: 22 Jul 2019, 17:34:12 UTC - in response to Message 39415.  

Also here,
Strange that upload seems to be OK.
Only scheduler seems to work. All others disable.
I have about 100 WU waiting validation and server report only 20 for all the world.
Not forget to read the annoucment of past week !



here, BOINC shows that finished tasks were uploaded, but still waiting for being reported.
ID: 39416 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1748
Credit: 115,243,732
RAC: 90,566
Message 39417 - Posted: 22 Jul 2019, 17:54:55 UTC

right now, the server status page shows that only the scheduler has been startet - everything else is still down :-(

What went wrong there today?
ID: 39417 · Report as offensive     Reply Quote
greg_be

Send message
Joined: 28 Dec 08
Posts: 334
Credit: 4,789,892
RAC: 2,031
Message 39418 - Posted: 22 Jul 2019, 22:30:13 UTC

7/22/2019 11:25:17 PM (GMT+1) | LHC@home | Server error: feeder not running

Can report tasks and they appear to be getting validated, but I do have 4 six track tasks stuck in the holding cell pending validation.

The few CMS tasks I had in queue have validated and so did the ATLAS stuff.

Just looks like you can get work right now.
ID: 39418 · Report as offensive     Reply Quote
Dark Angel
Avatar

Send message
Joined: 7 Aug 11
Posts: 93
Credit: 23,407,774
RAC: 15,382
Message 39420 - Posted: 23 Jul 2019, 4:21:01 UTC

Can't get new work, can't report completed work. Theory Native and Atlas VBox both affected.
ID: 39420 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 39421 - Posted: 23 Jul 2019, 6:16:24 UTC

Many tasks ready to report both Linux and Windows.
Tullio
ID: 39421 · Report as offensive     Reply Quote
Dark Angel
Avatar

Send message
Joined: 7 Aug 11
Posts: 93
Credit: 23,407,774
RAC: 15,382
Message 39422 - Posted: 23 Jul 2019, 7:46:03 UTC

Seems to be working now, thanks for the fix.
ID: 39422 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2500
Credit: 248,159,167
RAC: 120,486
Message 39424 - Posted: 23 Jul 2019, 8:43:22 UTC

Got a CMS task but it failed as there are no subtasks:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=237448622
ID: 39424 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1048
Credit: 7,477,327
RAC: 7,692
Message 40127 - Posted: 11 Oct 2019, 15:50:01 UTC

There is currently disruption to CMS jobs -- there are jobs in the queue but they are not being served out. I notified CERN IT, but it would be best to set No New Tasks to avoid lots of task failures. Unfortunately I can't check in overnight; I'll update the situation tomorrow.
ID: 40127 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1048
Credit: 7,477,327
RAC: 7,692
Message 40133 - Posted: 12 Oct 2019, 14:13:35 UTC - in response to Message 40127.  

There is currently disruption to CMS jobs -- there are jobs in the queue but they are not being served out. I notified CERN IT, but it would be best to set No New Tasks to avoid lots of task failures. Unfortunately I can't check in overnight; I'll update the situation tomorrow.

Jobs are being served, but slowly. I submitted eight tasks this afternoon, and only one got a job to run before the ten-minute time-out. I'm keeping the CERN crew notified.
ID: 40133 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : News : CMS@Home disruption, Monday 22nd July


©2024 CERN