Message boards :
CMS Application :
no new WUs available
Message board moderation
Previous · 1 . . . 18 · 19 · 20 · 21 · 22 · 23 · 24 · Next
Author | Message |
---|---|
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 245 |
No work to do, but this workstation has loaded hundreds of wu s just for killing them: I've noted a few cases like this, but there doesn't seem to be much we can do about it. The bright side is that [s]he's not harming the rest of the volunteer community, just wasting their own computer/electricity. Workunits (or tasks) are generated on the fly if/when jobs are available. They translate into a virtual machine (VM) instantiated under VirtualBox on the host machine, which then joins a HTCondor cluster and polls a condor server for a job. It seems from the printout that these tasks aren't even reaching that point so they are not "stealing" or in any other way misappropriating jobs that could run on more deserving hosts. |
Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,946,652 RAC: 18,983 |
Ivan, any news on this?There seems to be a problem at CERN. Several WMAgents, including ours, are showing error status and I don't think we are generating jobs. A polite e-mail has been sent. |
Send message Joined: 14 Jan 10 Posts: 1422 Credit: 9,484,585 RAC: 1,038 |
Ivan, any news on this?No news, but saw 194 Unsent on the server status page, so I tried to get one with success: created 20 Jun 2024, 9:25:02 UTC Inside job running on 4 cores and Ivan created them on the 14th of June. |
Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,946,652 RAC: 18,983 |
all my hosts which I have programmed for CMS have been receiving tasks within the past 2 hours; the tasks are running okay, so obviously there are jobs, too. |
Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,946,652 RAC: 18,983 |
all my hosts which I have programmed for CMS have been receiving tasks within the past 2 hours; the tasks are running okay, so obviously there are jobs, too.still receiving new tasks and jobs :-) So - good work that has been done over there to revive CMS :-) |
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 245 |
Yes, we've successfully updated (almost?) everything to RHEL9 now -- we switched to a new instantiation of the Data-Bridge without anyone noticing! The final switch will be to the new WMAgent (vocms267 instead of vocms0267 -- confusing, right?) which will probably happen on Monday or so. I think I've got everything ready my side, Laurence has to modify a script or two on the BOINC end of things. |
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 245 |
|
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 245 |
...and I see that I've managed to submit a new workflow to the replacement WMAgent. It now remains for the BOINC team to have the task generator recognise this and create tasks pointing to the condor pool, and generally switch everything from vocms0267 to vocms267. We're being held up by an apparent authentication issue in querying the condor pool. Various experts are investigating. |
Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,946,652 RAC: 18,983 |
Ivan, many thanks for the interim information :-) |
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 245 |
...and I see that I've managed to submit a new workflow to the replacement WMAgent. It now remains for the BOINC team to have the task generator recognise this and create tasks pointing to the condor pool, and generally switch everything from vocms0267 to vocms267. We might have stumbled onto an edge-case in the HTCondor API. This command used to work: [lxplus958:~] > condor_q -name vocms267.cern.ch -pool vocms0840.cern.ch -const 'CMS_JobType=?="Production"' -totals -- Failed to fetch ads from: <188.185.64.105:4080?addrs=[2001-1458-d00-1--100-85]-4080+188.185.64.105-4080&alias=vocms267.cern.ch&noUDP&sock=schedd_4178_56c6> : vocms267.cern.ch AUTHENTICATE:1003:Failed to authenticate with any methodHowever if we add the username to the list of requirements, it then does give a result: [lxplus958:~] > condor_q -name vocms267.cern.ch -pool vocms0840.cern.ch -const 'CMS_JobType=?="Production"' -totals cmst1 -- Schedd: vocms267.cern.ch : <188.185.64.105:4080?... @ 06/27/24 11:38:26 Total for query: 2000 jobs; 0 completed, 0 removed, 2000 idle, 0 running, 0 held, 0 suspended Total for all users: 2000 jobs; 0 completed, 0 removed, 2000 idle, 0 running, 0 held, 0 suspended |
Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,946,652 RAC: 18,983 |
Ivan - any idea when we might expect new tasks ? |
Send message Joined: 3 Nov 12 Posts: 59 Credit: 142,193,076 RAC: 37,599 |
Got some WUs tonight, but they all run without workload. e.g https://lhcathome.cern.ch/lhcathome/result.php?resultid=412434047 |
Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,946,652 RAC: 18,983 |
Got some WUs tonight, but they all run without workload.oh, that's too bad: tasks available, but no jobs available :-( |
Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,946,652 RAC: 18,983 |
oh, that's too bad: tasks available, but no jobs available :-(tasks are still being distributed, but no jobs. I tested it here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=412465180 |
Send message Joined: 4 Sep 22 Posts: 92 Credit: 16,008,656 RAC: 9,877 |
If you have not noticed it, all the CMS tasks are being reported immediately. Check the client_state.xml file in the boinc directory and you will find <report_immediately/> for every one of them. This is something I would not expect to see if the tasks included a data payload. That makes me wonder if what we are getting right now is some massive test of the software before actual work payloads are sent out. See Ivan's post of 27 June: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4209&postid=50460 |
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 245 |
...and I see that I've managed to submit a new workflow to the replacement WMAgent. It now remains for the BOINC team to have the task generator recognise this and create tasks pointing to the condor pool, and generally switch everything from vocms0267 to vocms267. Well, that took a while! Sorry, but it's Summer, in Europe. (Could be worse -- could be August, in France!). Finally traced down to our new Agent refusing to accept IPv6 requests! One of our bevy of experts just reconfigured it to only use IPv4 and now jobs are being sent out to tasks (i.e. your VMs). I'll leave the analysis and final resolution to the rest of the experts[1] -- We ended up with eight different participants in the e-mail chain. [1] Expert: an ex- is a has-been; a spurt is a drip under pressure... |
Send message Joined: 28 Sep 04 Posts: 732 Credit: 49,367,266 RAC: 17,281 |
The tasks and jobs have now run out. |
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 245 |
The tasks and jobs have now run out. Yes, I see. That was unexpected, this batch was due to last another couple of days. condor_q says there are 80 jobs still in the pool, but they are all running. I'll submit another batch while I investigate -- the WMAgent status web-page is not responding at the moment. |
Send message Joined: 28 Sep 04 Posts: 732 Credit: 49,367,266 RAC: 17,281 |
Seems to be OK now. Got one task running Ok, one hour on the clock now. |
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 245 |
|
©2024 CERN