Message boards :
CMS Application :
Hitting new heights
Message board moderation
Author | Message |
---|---|
Send message Joined: 29 Aug 05 Posts: 1005 Credit: 6,269,877 RAC: 404 |
Surprisingly, for a holiday period, we are hitting new heights with the CMS@home project. We currently have just over 2,000 jobs being run -- is everyone trying out their new Christmas-present cruncher? However, with the rising job rate comes a rising failure rate as well. I hypothesize that this is because of the rising proportion of jobs being run "out in the field", as opposed to those which run on a small VM farm at CERN. These machines provide the capacity to do tasks that are impossible for volunteer machines, e.g. merging the 60-70 MB result files into larger 2-3 GB files for more efficient file handling. These machines usually run 24/7 so there is almost never any problem with machines being shut down and restarted, and their network connections are obviously more stable than domestic broadband. So, I'd just like to remind you that the Virtual Machines that we use for CMS (and several other applications) need a bit more love and care than a bare-metal PC. In particular, BOINC should be stopped and it verified that the VirtualBox VM has saved its state and finished up, before turning the PC off. With a little bit of care and attention, we should be able to keep our job success rate at a good value (currently 95-96%). Thanks. |
Send message Joined: 28 Sep 04 Posts: 675 Credit: 43,547,331 RAC: 15,488 |
I think that one reason behind high CMS task consumption might be that other LHC VB tasks show their Ready To Send (Atlas, LHCb, Theory) queues as empty. Still quite a lot of tasks out in the field for those subprojects. And plenty of sixtrack tasks waiting to be crunched and that RTS queue is growing again. |
Send message Joined: 29 Aug 05 Posts: 1005 Credit: 6,269,877 RAC: 404 |
|
Send message Joined: 29 Aug 05 Posts: 1005 Credit: 6,269,877 RAC: 404 |
One potential problem that might show up tomorrow or so is that we seem to be allocating jobs to tasks/machines faster than we can create them -- the queue of pending jobs is falling slowly. I'm not sure if this is a limitation within the WMAgent work-flow manager, the HTCondor job server, or the communications between the two. Actually, the queue population has levelled off in the last couple of hours at around 750 jobs so it might be self-healing (Ah, SixTrack has released some new tasks); my understanding is that we'd set the goal at 2,000 but that was when we had about half as many jobs running as we do today! I'll monitor it as best I can but I need to get some bed-time (more heating problems; bed is the warmest place at the moment!). If you prefer to run other projects, feel free to disable the fall-back to CMS; if you run other projects, feel free to inject more tasks! :-) |
Send message Joined: 29 Aug 05 Posts: 1005 Credit: 6,269,877 RAC: 404 |
|
Send message Joined: 29 Aug 05 Posts: 1005 Credit: 6,269,877 RAC: 404 |
|
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 |
I'll monitor it as best I can but I need to get some bed-time (more heating problems; bed is the warmest place at the moment!). We are at -6C in eastern Pennsylvania (-10C at night), so it will be a while before the Arctic air mass moves on. (I think bed is a better idea than Donald's suggestion though.) |
Send message Joined: 29 Aug 05 Posts: 1005 Credit: 6,269,877 RAC: 404 |
|
©2024 CERN