Message boards : CMS Application : no new WUs available
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 19 · 20 · 21 · 22 · 23 · 24 · Next

AuthorMessage
Profile rbpeake

Send message
Joined: 17 Sep 04
Posts: 105
Credit: 32,824,862
RAC: 59
Message 50481 - Posted: 12 Jul 2024, 14:23:11 UTC - in response to Message 50480.  

Is the system setup to provide continuous work without human Intervention? I thought that was the goal?
Thanks.
Regards,
Bob P.
ID: 50481 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1061
Credit: 7,737,455
RAC: 201
Message 50482 - Posted: 12 Jul 2024, 18:06:07 UTC - in response to Message 50481.  

Is the system setup to provide continuous work without human Intervention? I thought that was the goal?
Thanks.

Not at the moment, although ultimately we would like to do that. There are so many variables to keep track of and to assess that it's hard to just let the system run on autopilot. As we have seen in the past weeks, there are so many steps in the chain of getting from a job specification to a body of results in storage that interruption to any one ripples down to a break in service. We also need to be able to plan ahead for any interruptions (like the periodic updates to WMAgent code) so we don't want to build up a backlog of jobs that would take weeks to clear, and thus I try to only commit a few days' worth of jobs at a time. Naturally, I sometimes get caught out on this -- sleeping in too late on a Sunday, for example. [Hmm, we've got the Lionesses playing Ireland in an hour or so, and the Lions vs. España Sunday night; I'd better check my plans for this weekend!]
ID: 50482 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1177
Credit: 54,887,670
RAC: 3,877
Message 50501 - Posted: 24 Jul 2024, 2:17:11 UTC

How is this possible running 4 core CMS tasks hundreds of times?


I was just looking around to see how they are running here.
ID: 50501 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 732
Credit: 49,373,095
RAC: 13,741
Message 50502 - Posted: 24 Jul 2024, 21:25:47 UTC

Ready to Send queue is dropping, we are down to 90 tasks at the moment. Minor hick-up or running out? Running jobs are still at about 330.
ID: 50502 · Report as offensive     Reply Quote
Saturn911

Send message
Joined: 3 Nov 12
Posts: 59
Credit: 142,193,076
RAC: 32,238
Message 50503 - Posted: 25 Jul 2024, 13:23:18 UTC - in response to Message 50501.  
Last modified: 25 Jul 2024, 13:26:46 UTC

How is this possible running 4 core CMS tasks hundreds of times?


I was just looking around to see how they are running here.

1100 credits for 2minutes of cpu.
One of the best invests I'v ever seen.
And this without scientific benefits.
A typical "thanks for nothing".
Or just waste of time...
ID: 50503 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1061
Credit: 7,737,455
RAC: 201
Message 50511 - Posted: 26 Jul 2024, 16:09:42 UTC - in response to Message 50502.  
Last modified: 26 Jul 2024, 17:16:10 UTC

Ready to Send queue is dropping, we are down to 90 tasks at the moment. Minor hick-up or running out? Running jobs are still at about 330.

That was probably because I'd not got the next workflow into the queue before the current one started to eat into its backlog. We've been seeing an increase in the number of running jobs this week (from ~240 up to ~500) so I've started submitting 5,000 jobs/batch instead of 2,000, to give me a better chance of catching workflows before they run out. We also had an anomaly last weekend due to one host with a network problem -- it burnt through about 600 jobs due to not being able to connect to the frontier (conditions database) servers.
ID: 50511 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1821
Credit: 118,983,735
RAC: 18,277
Message 50524 - Posted: 27 Jul 2024, 5:25:50 UTC

Ivan,

right now the situation is that jobs are available, but no tasks are available to process the jobs :-)
ID: 50524 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1821
Credit: 118,983,735
RAC: 18,277
Message 50525 - Posted: 27 Jul 2024, 6:31:13 UTC - in response to Message 50524.  

Ivan,

right now the situation is that jobs are available, but no tasks are available to process the jobs :-)
edit: meanwhile, no jobs either
ID: 50525 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1061
Credit: 7,737,455
RAC: 201
Message 50526 - Posted: 27 Jul 2024, 11:23:16 UTC - in response to Message 50525.  

Ivan,

right now the situation is that jobs are available, but no tasks are available to process the jobs :-)
edit: meanwhile, no jobs either

Sorry, I was sleeping. Yes, our WMAgent has died. I've alerted the appropriate message board at CERN.
ID: 50526 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1061
Credit: 7,737,455
RAC: 201
Message 50527 - Posted: 27 Jul 2024, 12:57:48 UTC - in response to Message 50526.  

Agent restarted. 800 jobs transferred to pool, 51 already running...
ID: 50527 · Report as offensive     Reply Quote
hadron

Send message
Joined: 4 Sep 22
Posts: 92
Credit: 16,008,656
RAC: 8,102
Message 50528 - Posted: 27 Jul 2024, 13:43:33 UTC

There are new tasks, yes, but so far they all seem to be more of those do-nothing tasks.
ID: 50528 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2541
Credit: 254,608,838
RAC: 23,290
Message 50529 - Posted: 27 Jul 2024, 14:30:13 UTC - in response to Message 50527.  

Got fresh tasks/jobs and so far all of them are running fine.
ID: 50529 · Report as offensive     Reply Quote
hadron

Send message
Joined: 4 Sep 22
Posts: 92
Credit: 16,008,656
RAC: 8,102
Message 50531 - Posted: 28 Jul 2024, 0:13:27 UTC - in response to Message 50529.  

Got fresh tasks/jobs and so far all of them are running fine.

I was premature with my comment about all of them being do-nothing tasks.
There were only a few at first (likely old stuff being sent out again for whatever reason), and then the "real" tasks started up here ---
then I got busy doing other stuff, and have only got back here now to report.
ID: 50531 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1422
Credit: 9,484,585
RAC: 852
Message 50532 - Posted: 28 Jul 2024, 6:13:18 UTC - in response to Message 50529.  

The CMS patch activated last night affects the process inside the VM.
It has nothing to do with BOINC (especially the work fetch).
Hence, BOINC related issues are not caused by the CMS patch.
Sometimes coïncidences happen, but not getting CMS-tasks, although enough tasks in queue has happened even with no work on that BOINC instance al all.
Since the queue went down to zero and Ivan refilled the batch with jobs, causing BOINC to create new tasks, I finally got a CMS-task.

https://lhcathome.cern.ch/lhcathome/results.php?hostid=10793807
ID: 50532 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1061
Credit: 7,737,455
RAC: 201
Message 50533 - Posted: 28 Jul 2024, 13:21:32 UTC - in response to Message 50532.  

The WMAgent died (why does this always happen on a weekend?) because of a miscommunication between the agent and the condor server. This meant no new jobs were injected into the condor pool, so it drained of jobs as they finished and there were no fresh jobs to be allocated. This meant that the BOINC server was seeing no new jobs in the pool, so it stopped allocating new tasks to Volunteer machines. When I notified CERN of the problem, the affected WMAgent component was restarted, and jobs from the current (unfinished) workflow were again submitted to the pool. When the BOINC server next checked the job pool it saw available jobs again and so restarted Task creation. The new tasks (VMs in host machines) were then able to acquire jobs and restart processing.
ID: 50533 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1061
Credit: 7,737,455
RAC: 201
Message 50559 - Posted: 20 Aug 2024, 14:10:58 UTC

We've run into a problem with the WMAgent and need to drain the queues. Fortunately we are near the end of a batch of jobs, so we should be able to apply the fix tomorrow. Take the usual countermeasures to cope with a short period of job unavailability starting in 12 hours or so.
ID: 50559 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1061
Credit: 7,737,455
RAC: 201
Message 50562 - Posted: 21 Aug 2024, 15:31:22 UTC - in response to Message 50559.  

We've run into a problem with the WMAgent and need to drain the queues. Fortunately we are near the end of a batch of jobs, so we should be able to apply the fix tomorrow. Take the usual countermeasures to cope with a short period of job unavailability starting in 12 hours or so.

Intervention is over and jobs are available in the pool -- 18 have already been picked up by volunteer hosts.
ID: 50562 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 732
Credit: 49,373,095
RAC: 13,741
Message 50625 - Posted: 14 Sep 2024, 8:55:12 UTC

We have run out of CMS work.
ID: 50625 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1821
Credit: 118,983,735
RAC: 18,277
Message 50657 - Posted: 29 Sep 2024, 3:08:39 UTC

the queue has run dry
ID: 50657 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2244
Credit: 173,902,375
RAC: 307
Message 50843 - Posted: 19 Oct 2024, 7:14:55 UTC - in response to Message 50657.  

the queue has run dry
ID: 50843 · Report as offensive     Reply Quote
Previous · 1 . . . 19 · 20 · 21 · 22 · 23 · 24 · Next

Message boards : CMS Application : no new WUs available


©2024 CERN