Message boards : CMS Application : no new WUs available
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 18 · 19 · 20 · 21 · 22 · 23 · 24 · Next

AuthorMessage
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1061
Credit: 7,737,455
RAC: 245
Message 50416 - Posted: 17 Jun 2024, 17:53:14 UTC - in response to Message 50414.  

No work to do, but this workstation has loaded hundreds of wu s just for killing them:

https://lhcathome.cern.ch/lhcathome/results.php?hostid=10698193&offset=0&show_names=0&state=0&appid=11

Would be great to limit the work for this workstation or to resend these wu s

I've noted a few cases like this, but there doesn't seem to be much we can do about it. The bright side is that [s]he's not harming the rest of the volunteer community, just wasting their own computer/electricity. Workunits (or tasks) are generated on the fly if/when jobs are available. They translate into a virtual machine (VM) instantiated under VirtualBox on the host machine, which then joins a HTCondor cluster and polls a condor server for a job. It seems from the printout that these tasks aren't even reaching that point so they are not "stealing" or in any other way misappropriating jobs that could run on more deserving hosts.
ID: 50416 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1821
Credit: 118,946,652
RAC: 18,983
Message 50422 - Posted: 20 Jun 2024, 3:48:07 UTC - in response to Message 50415.  

There seems to be a problem at CERN. Several WMAgents, including ours, are showing error status and I don't think we are generating jobs. A polite e-mail has been sent.

Polite response:
The CMSWEB team have been upgrading cmsweb-testbed frontends to a new technology and the redirect rules are still being polished (i.e. it looks like WM is still not fully functional).This transition started last Thursday.

Sorry about that.
Ivan, any news on this?
ID: 50422 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1422
Credit: 9,484,585
RAC: 1,038
Message 50423 - Posted: 20 Jun 2024, 9:57:48 UTC - in response to Message 50422.  
Last modified: 21 Jun 2024, 11:12:09 UTC

Ivan, any news on this?
No news, but saw 194 Unsent on the server status page, so I tried to get one with success: created 20 Jun 2024, 9:25:02 UTC
Inside job running on 4 cores and Ivan created them on the 14th of June.
ID: 50423 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1821
Credit: 118,946,652
RAC: 18,983
Message 50424 - Posted: 20 Jun 2024, 12:17:22 UTC - in response to Message 50423.  

all my hosts which I have programmed for CMS have been receiving tasks within the past 2 hours; the tasks are running okay, so obviously there are jobs, too.
ID: 50424 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1821
Credit: 118,946,652
RAC: 18,983
Message 50427 - Posted: 21 Jun 2024, 10:41:31 UTC - in response to Message 50424.  

all my hosts which I have programmed for CMS have been receiving tasks within the past 2 hours; the tasks are running okay, so obviously there are jobs, too.
still receiving new tasks and jobs :-)
So - good work that has been done over there to revive CMS :-)
ID: 50427 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1061
Credit: 7,737,455
RAC: 245
Message 50428 - Posted: 21 Jun 2024, 15:37:20 UTC

Yes, we've successfully updated (almost?) everything to RHEL9 now -- we switched to a new instantiation of the Data-Bridge without anyone noticing! The final switch will be to the new WMAgent (vocms267 instead of vocms0267 -- confusing, right?) which will probably happen on Monday or so. I think I've got everything ready my side, Laurence has to modify a script or two on the BOINC end of things.
ID: 50428 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1061
Credit: 7,737,455
RAC: 245
Message 50431 - Posted: 22 Jun 2024, 10:11:36 UTC

...and I see that I've managed to submit a new workflow to the replacement WMAgent. It now remains for the BOINC team to have the task generator recognise this and create tasks pointing to the condor pool, and generally switch everything from vocms0267 to vocms267.
ID: 50431 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1061
Credit: 7,737,455
RAC: 245
Message 50454 - Posted: 25 Jun 2024, 15:21:55 UTC - in response to Message 50431.  

...and I see that I've managed to submit a new workflow to the replacement WMAgent. It now remains for the BOINC team to have the task generator recognise this and create tasks pointing to the condor pool, and generally switch everything from vocms0267 to vocms267.

We're being held up by an apparent authentication issue in querying the condor pool. Various experts are investigating.
ID: 50454 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1821
Credit: 118,946,652
RAC: 18,983
Message 50455 - Posted: 25 Jun 2024, 16:25:13 UTC - in response to Message 50454.  

Ivan, many thanks for the interim information :-)
ID: 50455 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1061
Credit: 7,737,455
RAC: 245
Message 50460 - Posted: 27 Jun 2024, 9:37:37 UTC - in response to Message 50454.  
Last modified: 27 Jun 2024, 9:40:14 UTC

...and I see that I've managed to submit a new workflow to the replacement WMAgent. It now remains for the BOINC team to have the task generator recognise this and create tasks pointing to the condor pool, and generally switch everything from vocms0267 to vocms267.

We're being held up by an apparent authentication issue in querying the condor pool. Various experts are investigating.

We might have stumbled onto an edge-case in the HTCondor API. This command used to work:
[lxplus958:~] > condor_q -name vocms267.cern.ch -pool vocms0840.cern.ch -const 'CMS_JobType=?="Production"' -totals

-- Failed to fetch ads from: <188.185.64.105:4080?addrs=[2001-1458-d00-1--100-85]-4080+188.185.64.105-4080&alias=vocms267.cern.ch&noUDP&sock=schedd_4178_56c6> : vocms267.cern.ch
AUTHENTICATE:1003:Failed to authenticate with any method
However if we add the username to the list of requirements, it then does give a result:
[lxplus958:~] > condor_q -name vocms267.cern.ch -pool vocms0840.cern.ch -const 'CMS_JobType=?="Production"' -totals cmst1


-- Schedd: vocms267.cern.ch : <188.185.64.105:4080?... @ 06/27/24 11:38:26
Total for query: 2000 jobs; 0 completed, 0 removed, 2000 idle, 0 running, 0 held, 0 suspended 
Total for all users: 2000 jobs; 0 completed, 0 removed, 2000 idle, 0 running, 0 held, 0 suspended


ID: 50460 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1821
Credit: 118,946,652
RAC: 18,983
Message 50465 - Posted: 4 Jul 2024, 12:49:44 UTC

Ivan - any idea when we might expect new tasks ?
ID: 50465 · Report as offensive     Reply Quote
Saturn911

Send message
Joined: 3 Nov 12
Posts: 59
Credit: 142,193,076
RAC: 37,599
Message 50466 - Posted: 6 Jul 2024, 3:34:08 UTC - in response to Message 50465.  

Got some WUs tonight, but they all run without workload.
e.g
https://lhcathome.cern.ch/lhcathome/result.php?resultid=412434047
ID: 50466 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1821
Credit: 118,946,652
RAC: 18,983
Message 50467 - Posted: 6 Jul 2024, 7:45:05 UTC - in response to Message 50466.  

Got some WUs tonight, but they all run without workload.
e.g
https://lhcathome.cern.ch/lhcathome/result.php?resultid=412434047
oh, that's too bad: tasks available, but no jobs available :-(
ID: 50467 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1821
Credit: 118,946,652
RAC: 18,983
Message 50468 - Posted: 7 Jul 2024, 6:22:43 UTC - in response to Message 50467.  

oh, that's too bad: tasks available, but no jobs available :-(
tasks are still being distributed, but no jobs. I tested it here:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=412465180
ID: 50468 · Report as offensive     Reply Quote
hadron

Send message
Joined: 4 Sep 22
Posts: 92
Credit: 16,008,656
RAC: 9,877
Message 50469 - Posted: 7 Jul 2024, 16:35:07 UTC

If you have not noticed it, all the CMS tasks are being reported immediately. Check the client_state.xml file in the boinc directory and you will find <report_immediately/> for every one of them. This is something I would not expect to see if the tasks included a data payload.
That makes me wonder if what we are getting right now is some massive test of the software before actual work payloads are sent out. See Ivan's post of 27 June: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4209&postid=50460
ID: 50469 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1061
Credit: 7,737,455
RAC: 245
Message 50475 - Posted: 10 Jul 2024, 15:44:48 UTC - in response to Message 50454.  

...and I see that I've managed to submit a new workflow to the replacement WMAgent. It now remains for the BOINC team to have the task generator recognise this and create tasks pointing to the condor pool, and generally switch everything from vocms0267 to vocms267.

We're being held up by an apparent authentication issue in querying the condor pool. Various experts are investigating.

Well, that took a while! Sorry, but it's Summer, in Europe. (Could be worse -- could be August, in France!). Finally traced down to our new Agent refusing to accept IPv6 requests! One of our bevy of experts just reconfigured it to only use IPv4 and now jobs are being sent out to tasks (i.e. your VMs). I'll leave the analysis and final resolution to the rest of the experts[1] -- We ended up with eight different participants in the e-mail chain.

[1] Expert: an ex- is a has-been; a spurt is a drip under pressure...
ID: 50475 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 732
Credit: 49,367,266
RAC: 17,281
Message 50477 - Posted: 12 Jul 2024, 8:36:50 UTC

The tasks and jobs have now run out.
ID: 50477 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1061
Credit: 7,737,455
RAC: 245
Message 50478 - Posted: 12 Jul 2024, 10:11:12 UTC - in response to Message 50477.  
Last modified: 12 Jul 2024, 10:20:23 UTC

The tasks and jobs have now run out.

Yes, I see. That was unexpected, this batch was due to last another couple of days. condor_q says there are 80 jobs still in the pool, but they are all running. I'll submit another batch while I investigate -- the WMAgent status web-page is not responding at the moment.
ID: 50478 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 732
Credit: 49,367,266
RAC: 17,281
Message 50479 - Posted: 12 Jul 2024, 12:11:42 UTC

Seems to be OK now. Got one task running Ok, one hour on the clock now.
ID: 50479 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1061
Credit: 7,737,455
RAC: 245
Message 50480 - Posted: 12 Jul 2024, 12:27:26 UTC - in response to Message 50479.  

Seems to be OK now. Got one task running Ok, one hour on the clock now.

Tja, we seem to be back to "normal" but I'll try to keep an eye on things over the weekend (perhaps not for a couple of hours on Sunday night...).
ID: 50480 · Report as offensive     Reply Quote
Previous · 1 . . . 18 · 19 · 20 · 21 · 22 · 23 · 24 · Next

Message boards : CMS Application : no new WUs available


©2024 CERN