Thread 'Hitting new heights'

Author	Message
ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1152 Credit: 11,734,920 RAC: 442	Message 33559 - Posted: 29 Dec 2017, 11:58:14 UTC Surprisingly, for a holiday period, we are hitting new heights with the CMS@home project. We currently have just over 2,000 jobs being run -- is everyone trying out their new Christmas-present cruncher? However, with the rising job rate comes a rising failure rate as well. I hypothesize that this is because of the rising proportion of jobs being run "out in the field", as opposed to those which run on a small VM farm at CERN. These machines provide the capacity to do tasks that are impossible for volunteer machines, e.g. merging the 60-70 MB result files into larger 2-3 GB files for more efficient file handling. These machines usually run 24/7 so there is almost never any problem with machines being shut down and restarted, and their network connections are obviously more stable than domestic broadband. So, I'd just like to remind you that the Virtual Machines that we use for CMS (and several other applications) need a bit more love and care than a bare-metal PC. In particular, BOINC should be stopped and it verified that the VirtualBox VM has saved its state and finished up, before turning the PC off. With a little bit of care and attention, we should be able to keep our job success rate at a good value (currently 95-96%). Thanks. ID: 33559 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 799 Credit: 65,020,733 RAC: 27,613	Message 33560 - Posted: 29 Dec 2017, 12:34:10 UTC I think that one reason behind high CMS task consumption might be that other LHC VB tasks show their Ready To Send (Atlas, LHCb, Theory) queues as empty. Still quite a lot of tasks out in the field for those subprojects. And plenty of sixtrack tasks waiting to be crunched and that RTS queue is growing again. ID: 33560 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1152 Credit: 11,734,920 RAC: 442	Message 33562 - Posted: 29 Dec 2017, 16:07:27 UTC - in response to Message 33560. Hmm, you're right.Oh well, nom, nom, nom! ID: 33562 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1152 Credit: 11,734,920 RAC: 442	Message 33569 - Posted: 29 Dec 2017, 20:26:44 UTC One potential problem that might show up tomorrow or so is that we seem to be allocating jobs to tasks/machines faster than we can create them -- the queue of pending jobs is falling slowly. I'm not sure if this is a limitation within the WMAgent work-flow manager, the HTCondor job server, or the communications between the two. Actually, the queue population has levelled off in the last couple of hours at around 750 jobs so it might be self-healing (Ah, SixTrack has released some new tasks); my understanding is that we'd set the goal at 2,000 but that was when we had about half as many jobs running as we do today! I'll monitor it as best I can but I need to get some bed-time (more heating problems; bed is the warmest place at the moment!). If you prefer to run other projects, feel free to disable the fall-back to CMS; if you run other projects, feel free to inject more tasks! :-) ID: 33569 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1152 Credit: 11,734,920 RAC: 442	Message 33571 - Posted: 29 Dec 2017, 21:21:14 UTC Last modified: 29 Dec 2017, 21:21:29 UTC The WMAgent has died! Please set NoNewTasks until I can contact an expert to restart it. ID: 33571 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1152 Credit: 11,734,920 RAC: 442	Message 33572 - Posted: 29 Dec 2017, 21:23:43 UTC - in response to Message 33571. The queue is already empty, and the number of running jobs is starting to fall. Please do set NoNewTasks. ID: 33572 · Reply Quote

Jim1348 Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0	Message 33573 - Posted: 29 Dec 2017, 21:48:29 UTC - in response to Message 33569. Last modified: 29 Dec 2017, 21:52:20 UTC I'll monitor it as best I can but I need to get some bed-time (more heating problems; bed is the warmest place at the moment!). We are at -6C in eastern Pennsylvania (-10C at night), so it will be a while before the Arctic air mass moves on. (I think bed is a better idea than Donald's suggestion though.) ID: 33573 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1152 Credit: 11,734,920 RAC: 442	Message 33574 - Posted: 29 Dec 2017, 22:50:24 UTC - in response to Message 33572. The queue is already empty, and the number of running jobs is starting to fall. Please do set NoNewTasks. Luckily I was able to reach Seangchan. We have jobs again -- Go get 'em. Rex! ID: 33574 · Reply Quote