Message boards : CMS Application : Hitting new heights
Message board moderation

To post messages, you must log in.

AuthorMessage
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1000
Credit: 6,266,299
RAC: 100
Message 33559 - Posted: 29 Dec 2017, 11:58:14 UTC

Surprisingly, for a holiday period, we are hitting new heights with the CMS@home project. We currently have just over 2,000 jobs being run -- is everyone trying out their new Christmas-present cruncher?
However, with the rising job rate comes a rising failure rate as well. I hypothesize that this is because of the rising proportion of jobs being run "out in the field", as opposed to those which run on a small VM farm at CERN. These machines provide the capacity to do tasks that are impossible for volunteer machines, e.g. merging the 60-70 MB result files into larger 2-3 GB files for more efficient file handling. These machines usually run 24/7 so there is almost never any problem with machines being shut down and restarted, and their network connections are obviously more stable than domestic broadband.
So, I'd just like to remind you that the Virtual Machines that we use for CMS (and several other applications) need a bit more love and care than a bare-metal PC. In particular, BOINC should be stopped and it verified that the VirtualBox VM has saved its state and finished up, before turning the PC off. With a little bit of care and attention, we should be able to keep our job success rate at a good value (currently 95-96%). Thanks.
ID: 33559 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 674
Credit: 43,499,682
RAC: 15,966
Message 33560 - Posted: 29 Dec 2017, 12:34:10 UTC

I think that one reason behind high CMS task consumption might be that other LHC VB tasks show their Ready To Send (Atlas, LHCb, Theory) queues as empty. Still quite a lot of tasks out in the field for those subprojects. And plenty of sixtrack tasks waiting to be crunched and that RTS queue is growing again.
ID: 33560 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1000
Credit: 6,266,299
RAC: 100
Message 33562 - Posted: 29 Dec 2017, 16:07:27 UTC - in response to Message 33560.  

Hmm, you're right.Oh well, nom, nom, nom!
ID: 33562 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1000
Credit: 6,266,299
RAC: 100
Message 33569 - Posted: 29 Dec 2017, 20:26:44 UTC

One potential problem that might show up tomorrow or so is that we seem to be allocating jobs to tasks/machines faster than we can create them -- the queue of pending jobs is falling slowly. I'm not sure if this is a limitation within the WMAgent work-flow manager, the HTCondor job server, or the communications between the two. Actually, the queue population has levelled off in the last couple of hours at around 750 jobs so it might be self-healing (Ah, SixTrack has released some new tasks); my understanding is that we'd set the goal at 2,000 but that was when we had about half as many jobs running as we do today!
I'll monitor it as best I can but I need to get some bed-time (more heating problems; bed is the warmest place at the moment!). If you prefer to run other projects, feel free to disable the fall-back to CMS; if you run other projects, feel free to inject more tasks! :-)
ID: 33569 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1000
Credit: 6,266,299
RAC: 100
Message 33571 - Posted: 29 Dec 2017, 21:21:14 UTC
Last modified: 29 Dec 2017, 21:21:29 UTC

The WMAgent has died! Please set NoNewTasks until I can contact an expert to restart it.
ID: 33571 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1000
Credit: 6,266,299
RAC: 100
Message 33572 - Posted: 29 Dec 2017, 21:23:43 UTC - in response to Message 33571.  

The queue is already empty, and the number of running jobs is starting to fall. Please do set NoNewTasks.
ID: 33572 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 33573 - Posted: 29 Dec 2017, 21:48:29 UTC - in response to Message 33569.  
Last modified: 29 Dec 2017, 21:52:20 UTC

I'll monitor it as best I can but I need to get some bed-time (more heating problems; bed is the warmest place at the moment!).

We are at -6C in eastern Pennsylvania (-10C at night), so it will be a while before the Arctic air mass moves on. (I think bed is a better idea than Donald's suggestion though.)
ID: 33573 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1000
Credit: 6,266,299
RAC: 100
Message 33574 - Posted: 29 Dec 2017, 22:50:24 UTC - in response to Message 33572.  

The queue is already empty, and the number of running jobs is starting to fall. Please do set NoNewTasks.

Luckily I was able to reach Seangchan. We have jobs again -- Go get 'em. Rex!
ID: 33574 · Report as offensive     Reply Quote

Message boards : CMS Application : Hitting new heights


©2024 CERN