no new WUs available

Author	Message
ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 245	Message 50416 - Posted: 17 Jun 2024, 17:53:14 UTC - in response to Message 50414. No work to do, but this workstation has loaded hundreds of wu s just for killing them: https://lhcathome.cern.ch/lhcathome/results.php?hostid=10698193&offset=0&show_names=0&state=0&appid=11 Would be great to limit the work for this workstation or to resend these wu s I've noted a few cases like this, but there doesn't seem to be much we can do about it. The bright side is that [s]he's not harming the rest of the volunteer community, just wasting their own computer/electricity. Workunits (or tasks) are generated on the fly if/when jobs are available. They translate into a virtual machine (VM) instantiated under VirtualBox on the host machine, which then joins a HTCondor cluster and polls a condor server for a job. It seems from the printout that these tasks aren't even reaching that point so they are not "stealing" or in any other way misappropriating jobs that could run on more deserving hosts. ID: 50416 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,946,652 RAC: 18,983	Message 50422 - Posted: 20 Jun 2024, 3:48:07 UTC - in response to Message 50415. There seems to be a problem at CERN. Several WMAgents, including ours, are showing error status and I don't think we are generating jobs. A polite e-mail has been sent. Polite response: The CMSWEB team have been upgrading cmsweb-testbed frontends to a new technology and the redirect rules are still being polished (i.e. it looks like WM is still not fully functional).This transition started last Thursday. Sorry about that. Ivan, any news on this? ID: 50422 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1422 Credit: 9,484,585 RAC: 1,038	Message 50423 - Posted: 20 Jun 2024, 9:57:48 UTC - in response to Message 50422. Last modified: 21 Jun 2024, 11:12:09 UTC Ivan, any news on this? No news, but saw 194 Unsent on the server status page, so I tried to get one with success: created 20 Jun 2024, 9:25:02 UTC Inside job running on 4 cores and Ivan created them on the 14th of June. ID: 50423 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,946,652 RAC: 18,983	Message 50424 - Posted: 20 Jun 2024, 12:17:22 UTC - in response to Message 50423. all my hosts which I have programmed for CMS have been receiving tasks within the past 2 hours; the tasks are running okay, so obviously there are jobs, too. ID: 50424 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,946,652 RAC: 18,983	Message 50427 - Posted: 21 Jun 2024, 10:41:31 UTC - in response to Message 50424. all my hosts which I have programmed for CMS have been receiving tasks within the past 2 hours; the tasks are running okay, so obviously there are jobs, too. still receiving new tasks and jobs :-) So - good work that has been done over there to revive CMS :-) ID: 50427 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 245	Message 50428 - Posted: 21 Jun 2024, 15:37:20 UTC Yes, we've successfully updated (almost?) everything to RHEL9 now -- we switched to a new instantiation of the Data-Bridge without anyone noticing! The final switch will be to the new WMAgent (vocms267 instead of vocms0267 -- confusing, right?) which will probably happen on Monday or so. I think I've got everything ready my side, Laurence has to modify a script or two on the BOINC end of things. ID: 50428 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 245	Message 50431 - Posted: 22 Jun 2024, 10:11:36 UTC ...and I see that I've managed to submit a new workflow to the replacement WMAgent. It now remains for the BOINC team to have the task generator recognise this and create tasks pointing to the condor pool, and generally switch everything from vocms0267 to vocms267. ID: 50431 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 245	Message 50454 - Posted: 25 Jun 2024, 15:21:55 UTC - in response to Message 50431. ...and I see that I've managed to submit a new workflow to the replacement WMAgent. It now remains for the BOINC team to have the task generator recognise this and create tasks pointing to the condor pool, and generally switch everything from vocms0267 to vocms267. We're being held up by an apparent authentication issue in querying the condor pool. Various experts are investigating. ID: 50454 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,946,652 RAC: 18,983	Message 50455 - Posted: 25 Jun 2024, 16:25:13 UTC - in response to Message 50454. Ivan, many thanks for the interim information :-) ID: 50455 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 245	Message 50460 - Posted: 27 Jun 2024, 9:37:37 UTC - in response to Message 50454. Last modified: 27 Jun 2024, 9:40:14 UTC ...and I see that I've managed to submit a new workflow to the replacement WMAgent. It now remains for the BOINC team to have the task generator recognise this and create tasks pointing to the condor pool, and generally switch everything from vocms0267 to vocms267. We're being held up by an apparent authentication issue in querying the condor pool. Various experts are investigating. We might have stumbled onto an edge-case in the HTCondor API. This command used to work: [lxplus958:~] > condor_q -name vocms267.cern.ch -pool vocms0840.cern.ch -const 'CMS_JobType=?="Production"' -totals -- Failed to fetch ads from: <188.185.64.105:4080?addrs=[2001-1458-d00-1--100-85]-4080+188.185.64.105-4080&alias=vocms267.cern.ch&noUDP&sock=schedd_4178_56c6> : vocms267.cern.ch AUTHENTICATE:1003:Failed to authenticate with any method However if we add the username to the list of requirements, it then does give a result: [lxplus958:~] > condor_q -name vocms267.cern.ch -pool vocms0840.cern.ch -const 'CMS_JobType=?="Production"' -totals cmst1 -- Schedd: vocms267.cern.ch : <188.185.64.105:4080?... @ 06/27/24 11:38:26 Total for query: 2000 jobs; 0 completed, 0 removed, 2000 idle, 0 running, 0 held, 0 suspended Total for all users: 2000 jobs; 0 completed, 0 removed, 2000 idle, 0 running, 0 held, 0 suspended ID: 50460 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,946,652 RAC: 18,983	Message 50465 - Posted: 4 Jul 2024, 12:49:44 UTC Ivan - any idea when we might expect new tasks ? ID: 50465 · Reply Quote

Saturn911 Send message Joined: 3 Nov 12 Posts: 59 Credit: 142,193,076 RAC: 37,599	Message 50466 - Posted: 6 Jul 2024, 3:34:08 UTC - in response to Message 50465. Got some WUs tonight, but they all run without workload. e.g https://lhcathome.cern.ch/lhcathome/result.php?resultid=412434047 ID: 50466 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,946,652 RAC: 18,983	Message 50467 - Posted: 6 Jul 2024, 7:45:05 UTC - in response to Message 50466. Got some WUs tonight, but they all run without workload. e.g https://lhcathome.cern.ch/lhcathome/result.php?resultid=412434047 oh, that's too bad: tasks available, but no jobs available :-( ID: 50467 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,946,652 RAC: 18,983	Message 50468 - Posted: 7 Jul 2024, 6:22:43 UTC - in response to Message 50467. oh, that's too bad: tasks available, but no jobs available :-( tasks are still being distributed, but no jobs. I tested it here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=412465180 ID: 50468 · Reply Quote

hadron Send message Joined: 4 Sep 22 Posts: 92 Credit: 16,008,656 RAC: 9,877	Message 50469 - Posted: 7 Jul 2024, 16:35:07 UTC If you have not noticed it, all the CMS tasks are being reported immediately. Check the client_state.xml file in the boinc directory and you will find <report_immediately/> for every one of them. This is something I would not expect to see if the tasks included a data payload. That makes me wonder if what we are getting right now is some massive test of the software before actual work payloads are sent out. See Ivan's post of 27 June: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4209&postid=50460 ID: 50469 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 245	Message 50475 - Posted: 10 Jul 2024, 15:44:48 UTC - in response to Message 50454. ...and I see that I've managed to submit a new workflow to the replacement WMAgent. It now remains for the BOINC team to have the task generator recognise this and create tasks pointing to the condor pool, and generally switch everything from vocms0267 to vocms267. We're being held up by an apparent authentication issue in querying the condor pool. Various experts are investigating. Well, that took a while! Sorry, but it's Summer, in Europe. (Could be worse -- could be August, in France!). Finally traced down to our new Agent refusing to accept IPv6 requests! One of our bevy of experts just reconfigured it to only use IPv4 and now jobs are being sent out to tasks (i.e. your VMs). I'll leave the analysis and final resolution to the rest of the experts[1] -- We ended up with eight different participants in the e-mail chain. [1] Expert: an ex- is a has-been; a spurt is a drip under pressure... ID: 50475 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 732 Credit: 49,367,266 RAC: 17,281	Message 50477 - Posted: 12 Jul 2024, 8:36:50 UTC The tasks and jobs have now run out. ID: 50477 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 245	Message 50478 - Posted: 12 Jul 2024, 10:11:12 UTC - in response to Message 50477. Last modified: 12 Jul 2024, 10:20:23 UTC The tasks and jobs have now run out. Yes, I see. That was unexpected, this batch was due to last another couple of days. condor_q says there are 80 jobs still in the pool, but they are all running. I'll submit another batch while I investigate -- the WMAgent status web-page is not responding at the moment. ID: 50478 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 732 Credit: 49,367,266 RAC: 17,281	Message 50479 - Posted: 12 Jul 2024, 12:11:42 UTC Seems to be OK now. Got one task running Ok, one hour on the clock now. ID: 50479 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 245	Message 50480 - Posted: 12 Jul 2024, 12:27:26 UTC - in response to Message 50479. Seems to be OK now. Got one task running Ok, one hour on the clock now. Tja, we seem to be back to "normal" but I'll try to keep an eye on things over the weekend (perhaps not for a couple of hours on Sunday night...). ID: 50480 · Reply Quote

LHC@home