Thread 'Interruption to CMS@Home, Wednesday 15th July'

Author	Message
ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1165 Credit: 12,130,929 RAC: 9,235	Message 43057 - Posted: 14 Jul 2020, 13:19:28 UTC We need to interrupt the CMS project tomorrow to deploy a new Workflow Management Agent. This means that jobs will not be available from sometime tonight. We recommend that you set your CMS machines to No New Tasks as soon as possible, to avoid tasks terminating with an error if a job can't be fetched. We anticipate jobs will be available again late Wednesday (European time). I'll update this thread when it is OK to proceed. ID: 43057 · Reply Quote

Mr P Hucker Send message Joined: 12 Aug 06 Posts: 483 Credit: 15,644,410 RAC: 10,900	Message 43061 - Posted: 15 Jul 2020, 11:41:19 UTC - in response to Message 43057. We need to interrupt the CMS project tomorrow to deploy a new Workflow Management Agent. This means that jobs will not be available from sometime tonight. We recommend that you set your CMS machines to No New Tasks as soon as possible, to avoid tasks terminating with an error if a job can't be fetched. We anticipate jobs will be available again late Wednesday (European time). I'll update this thread when it is OK to proceed. Why do we all have to set no new tasks? Surely you can just tell the server to stop handing them out? ID: 43061 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1165 Credit: 12,130,929 RAC: 9,235	Message 43063 - Posted: 15 Jul 2020, 15:07:17 UTC - in response to Message 43061. We need to interrupt the CMS project tomorrow to deploy a new Workflow Management Agent. This means that jobs will not be available from sometime tonight. We recommend that you set your CMS machines to No New Tasks as soon as possible, to avoid tasks terminating with an error if a job can't be fetched. We anticipate jobs will be available again late Wednesday (European time). I'll update this thread when it is OK to proceed. Why do we all have to set no new tasks? Surely you can just tell the server to stop handing them out? The way the system works, if you ask for CMS jobs when none are available then the BOINC task fails after ten minutes with an error, so any work already completed by that task is discarded and you get no credit for it. As well, each time a task fails, your daily BOINC task quota is reduced. Do that enough times and you will only be allocated one task per day. Your quota will only be increased again if/when you complete a successful task. No, I didn't write the BOINC server code... ID: 43063 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1165 Credit: 12,130,929 RAC: 9,235	Message 43064 - Posted: 15 Jul 2020, 15:08:07 UTC Jobs are available again. There was a hiccup, but we hit our target with time to spare. ID: 43064 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1165 Credit: 12,130,929 RAC: 9,235	Message 43067 - Posted: 15 Jul 2020, 21:08:21 UTC - in response to Message 43063. Last modified: 15 Jul 2020, 21:09:50 UTC I should have added: We do stop issuing tasks when we detect that there are no new jobs available -- a problem that we've only just solved was that jobs were available but unbeknownst to our software they were flagged as not to run on volunteer machines. This led to the sort of scenario I described earlier, as BOINC thought jobs were available and kept serving up tasks. Notwithstanding that, remember that our tasks run for more than 12 hours (usually less than 18), running several jobs consecutively. If we run out of jobs mid-task, that can lead to a task being flagged as failed. This is why I try to give as much warning as possible of an upcoming outage, so that tasks can be left to finish up before jobs run out. ID: 43067 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1561 Credit: 10,136,166 RAC: 1,208	Message 43078 - Posted: 16 Jul 2020, 8:41:33 UTC - in response to Message 43067. ... If we run out of jobs mid-task, that can lead to a task being flagged as failed. ... This is very bad and should not happen. The LHC@home team should take more part of the CMS@home and not let struggle Ivan alone. Can CMS not been setup like ATLAS and Theory running 1 single job during a BOINC-task receiving the needed input-data into the shared host-folder and not over the VM-network before the VM is started? ID: 43078 · Reply Quote