1)
Message boards :
CMS Application :
EXIT_NO_SUB_TASKS
(Message 47726)
Posted 11 days ago by ivan Post: A new workflow is in the pipeline. Jobs should be available in about one hour. |
2)
Message boards :
CMS Application :
EXIT_NO_SUB_TASKS
(Message 47718)
Posted 13 days ago by ivan Post: Updated calculation: the queues should be almost drained in 24 hours from now. |
3)
Message boards :
CMS Application :
EXIT_NO_SUB_TASKS
(Message 47713)
Posted 15 days ago by ivan Post: I'll be running down the queues at the weekend in preparation for a WMAgent upgrade next week. Be prepared to set No New Tasks on Sunday or so. Oops, I miscalculated (based on 10,000 jobs rather than 20,000) -- we'll run out of jobs Monday into Tuesday. |
4)
Message boards :
CMS Application :
EXIT_NO_SUB_TASKS
(Message 47712)
Posted 16 days ago by ivan Post: I'll be running down the queues at the weekend in preparation for a WMAgent upgrade next week. Be prepared to set No New Tasks on Sunday or so. |
5)
Message boards :
CMS Application :
CMS computation error in 30 seconds every time
(Message 47693)
Posted 18 days ago by ivan Post: https://lhcathome.cern.ch/lhcathome/result.php?resultid=377315268 2023-01-15 04:45:14 (2360): Guest Log: [INFO] Requesting an idtoken from LHC@home 2023-01-15 05:22:19 (2360): Guest Log: [INFO] glidein exited with return value 0.
|
6)
Message boards :
CMS Application :
CMS computation error in 30 seconds every time
(Message 47665)
Posted 23 days ago by ivan Post: Rough estimation:30. But I guess it depends on CPU speed. I have 126 cores :-/ From your latest good task: 2023-01-11 20:04:50 (7092): Guest Log: [INFO] Could not find a local HTTP proxy 2023-01-11 20:04:50 (7092): Guest Log: [INFO] CVMFS and Frontier will have to use DIRECT connections 2023-01-11 20:04:50 (7092): Guest Log: [INFO] This makes the application less efficient 2023-01-11 20:04:50 (7092): Guest Log: [INFO] It also puts higher load on the project servers 2023-01-11 20:04:50 (7092): Guest Log: [INFO] Setting up a local HTTP proxy is highly recommended 2023-01-11 20:04:51 (7092): Guest Log: [INFO] Advice can be found in the project forumRemember that you download a lot of data as well. If you had a local squid proxy you could greatly reduce your downloads and the consequent time spent. I think the instructions are in the Number Crunching forum. |
7)
Message boards :
CMS Application :
CMS computation error in 30 seconds every time
(Message 47658)
Posted 23 days ago by ivan Post: In general, yes. This runs at two levels -- WMAgent produces jobs and sends them to the HTCondor server. Your VM instances (i.e. BOINC tasks) ask the condor server for a job. If that job terminates with an error, or your VM goes out of contact from the server for too long (currently two hours), the condor server requeues it and sends it to a new VM when the queueing allows. If there are several errors for the one job (currently three, IIRC) the condor server notifies the WMAgent which then itself requeues the job for future resubmission back to the condor server. If the job terminates without error, then the VM will ask for another job, up until 12+ hours have elapsed in total.Unfortunately the reason is not reported back to higher script levels, hence BOINC gets a 'success'.That's worrying, they're marked as valid in my list of tasks on the server. Does the system later notice a problem and resend those tasks? Or does it royally screw up the science?No, if you look at the job graphs from the homepage, you will see 5-10% job failures, but these are the primary failures seen by condor. In the (unfortunately non-public) monitoring we see the ultimate failure rate is essentially zero for every 20,000-job workflow submission. We do tend to be generous, and allow credits for CPU time given even if we could detect a failure -- as alluded to above -- but egregious errors will not get any credit, though. I will of course be keeping an eye on the run time vs CPU time on the server list of tasks to make sure mine behave and are useful.You should, ideally, be seeing task logs that look like mine -- https://lhcathome.cern.ch/lhcathome/results.php?userid=14095 -- tasks running for 12 hours or so with CPU time slightly less. Each task (VM instance) is therefore running 5 or 6 2-hour CMS jobs before terminating to allow BOINC to start a new task. |
8)
Message boards :
CMS Application :
EXIT_NO_SUB_TASKS
(Message 47629)
Posted 30 Dec 2022 by ivan Post: Anybody there to refill the CMS queue this year? Jobs are available again now. |
9)
Message boards :
CMS Application :
EXIT_NO_SUB_TASKS
(Message 47628)
Posted 30 Dec 2022 by ivan Post: Anybody there to refill the CMS queue this year? Yes, it's just that we've been running more jobs than usual lately, and I have been taking it easy... New batch in the pipeline. |
10)
Message boards :
CMS Application :
EXIT_NO_SUB_TASKS
(Message 47581)
Posted 7 Dec 2022 by ivan Post: A new workflow submitted last night is stuck in "staging" rather than progressing to "running" so we are running out of jobs. Probably best to set No New Tasks until it's sorted. I've submitted another workflow, but I don't think it will bypass the older one in the queue. WMCore team have been notified. The underlying cause is detailed in https://github.com/dmwm/WMCore/issues/11386 and the cure in https://github.com/dmwm/WMCore/pull/11387. All my machines now are running a quota of tasks. |
11)
Message boards :
CMS Application :
EXIT_NO_SUB_TASKS
(Message 47580)
Posted 7 Dec 2022 by ivan Post: A new workflow submitted last night is stuck in "staging" rather than progressing to "running" so we are running out of jobs. Probably best to set No New Tasks until it's sorted. I've submitted another workflow, but I don't think it will bypass the older one in the queue. WMCore team have been notified. The logjam has just been cleared and the task server has noticed that jobs are available, and is sending out tasks again -- I've just got two on my first machine. |
12)
Message boards :
CMS Application :
EXIT_NO_SUB_TASKS
(Message 47577)
Posted 6 Dec 2022 by ivan Post: A new workflow submitted last night is stuck in "staging" rather than progressing to "running" so we are running out of jobs. Probably best to set No New Tasks until it's sorted. I've submitted another workflow, but I don't think it will bypass the older one in the queue. WMCore team have been notified. More later. |
13)
Message boards :
CMS Application :
CMS Simulation error
(Message 47564)
Posted 25 Nov 2022 by ivan Post: From a responsible:Exit code 127 usually means that the executable wasn't found which gives a big clue. We've reverted the change that garbled our glidein script -- I'm running main and -dev jobs successfully now. |
14)
Message boards :
CMS Application :
CMS Simulation error
(Message 47563)
Posted 25 Nov 2022 by ivan Post: From a responsible: Exit code 127 usually means that the executable wasn't found which gives a big clue. It seems we might have picked up some other changes when we modified the hibernation time-out. Investigations continue. |
15)
Message boards :
CMS Application :
CMS Simulation error
(Message 47562)
Posted 25 Nov 2022 by ivan Post: We changed a condor parameter yesterday, altering the time that condor waited for a disconnected machine to resume from one day to two hours. It is possible that this is conflicting with some other requirement, leading glidein to abort. Emails have been sent and I'm searching for glidein code to see what error 127 indicates. |
16)
Message boards :
CMS Application :
CMS Simulation error
(Message 47561)
Posted 25 Nov 2022 by ivan Post: Yes, there's a problem. No, I haven't found it yet. If I try to look at our condor jobs from CERN I get [lxplus780:~] > condor_q -name vocms0267.cern.ch -pool vocms0840.cern.ch -- Failed to fetch ads from: <137.138.52.94:4080?addrs=137.138.52.94-4080+[2001-1458-d00-17--43]-4080&alias=vocms0267.cern.ch&noUDP&sock=schedd_1783_9173> : vocms0267.cern.ch AUTHENTICATE:1003:Failed to authenticate with any method and the pool is depleted: [lxplus780:~] > condor_status -pool vocms0840.cern.ch Name OpSys Arch State Activity LoadAv Mem ActvtyTime glidein_4670_387833785@107107-10489077-18060 LINUX X86_64 Claimed Busy 2.330 2500 0+01:00:46 glidein_4680_323803200@107107-10489077-31377 LINUX X86_64 Claimed Busy 3.070 2500 0+03:49:52 glidein_4648_45869712@107107-10511611-11990 LINUX X86_64 Claimed Busy 1.250 2500 0+01:03:09 glidein_4656_822528270@107107-10511611-15555 LINUX X86_64 Claimed Busy 1.210 2500 0+02:43:25 glidein_4653_323462880@107107-10511611-16804 LINUX X86_64 Claimed Busy 1.430 2500 0+03:35:45 glidein_4646_361662954@107107-10511611-18323 LINUX X86_64 Claimed Busy 1.210 2500 0+03:24:27 glidein_4648_563620040@107107-10511611-22062 LINUX X86_64 Claimed Busy 2.260 2500 0+02:32:14 glidein_4664_267919542@107107-10511611-30334 LINUX X86_64 Claimed Busy 1.260 2500 0+01:59:35 glidein_4656_121852692@107107-10511611-30765 LINUX X86_64 Claimed Busy 2.260 2500 0+00:11:12 glidein_4670_72564800@107107-10511679-3097 LINUX X86_64 Claimed Busy 2.270 2500 0+00:37:28 glidein_4657_476277228@107107-10511679-15034 LINUX X86_64 Claimed Busy 1.690 2500 0+00:40:13 glidein_4656_29749126@107107-10511679-18362 LINUX X86_64 Claimed Busy 2.310 2500 0+02:11:44 glidein_4645_96284106@107107-10511679-21818 LINUX X86_64 Claimed Busy 1.500 2500 0+01:00:07 glidein_4874_26963200@107107-10511679-23757 LINUX X86_64 Claimed Busy 1.420 2500 0+01:33:39 glidein_4656_572271750@107107-10511679-29938 LINUX X86_64 Claimed Busy 1.530 2500 0+03:33:02 glidein_4661_72786555@107107-10511679-31672 LINUX X86_64 Claimed Busy 0.810 2500 0+00:01:37 glidein_4651_831188082@107107-10574756-27824 LINUX X86_64 Claimed Busy 1.850 2500 0+01:08:16 glidein_4646_553968128@107107-10574756-28110 LINUX X86_64 Claimed Busy 1.500 2500 0+01:59:27 glidein_4645_266261752@107107-10574756-28244 LINUX X86_64 Claimed Busy 2.010 2500 0+01:58:33 glidein_4644_353209032@107107-10574756-29278 LINUX X86_64 Claimed Busy 1.740 2500 0+02:51:58 glidein_4649_107372800@107107-10574756-29861 LINUX X86_64 Claimed Busy 1.770 2500 0+00:35:45 glidein_4646_432356158@107107-10574756-29943 LINUX X86_64 Claimed Busy 1.910 2500 0+01:10:31 glidein_4647_591695930@107107-10574756-31949 LINUX X86_64 Claimed Busy 1.740 2500 0+01:24:21 glidein_4650_109275384@107107-10574767-7200 LINUX X86_64 Claimed Busy 1.760 2500 0+03:16:18 glidein_4654_146209624@107107-10574767-7326 LINUX X86_64 Claimed Busy 2.040 2500 0+02:10:39 glidein_4651_667607907@107107-10574767-11294 LINUX X86_64 Claimed Busy 2.230 2500 0+02:30:02 glidein_4643_676121242@107107-10574767-11526 LINUX X86_64 Claimed Busy 0.000 2500 0+00:00:00 glidein_4750_71875323@107107-10574767-22926 LINUX X86_64 Claimed Busy 1.730 2500 0+02:27:44 glidein_4653_455226972@107107-10574767-24067 LINUX X86_64 Claimed Busy 1.410 2500 0+00:22:12 glidein_4652_527975600@107107-10574767-24820 LINUX X86_64 Claimed Busy 1.440 2500 0+00:32:53 glidein_4647_415785516@107107-10574816-7681 LINUX X86_64 Claimed Busy 1.660 2500 0+01:57:45 glidein_4652_241555920@107107-10574816-12265 LINUX X86_64 Claimed Busy 1.970 2500 0+00:36:03 glidein_4650_91299585@107107-10574816-12578 LINUX X86_64 Claimed Busy 1.030 2500 0+03:04:33 glidein_4649_824674365@107107-10574816-18524 LINUX X86_64 Claimed Busy 0.240 2500 0+00:00:00 glidein_4651_914472658@107107-10574816-23179 LINUX X86_64 Claimed Busy 1.770 2500 0+01:32:32 glidein_4667_286539704@107107-10574816-29186 LINUX X86_64 Claimed Busy 2.400 2500 0+01:10:41 glidein_4645_450114845@107107-10803618-56 LINUX X86_64 Claimed Busy 1.490 2500 0+00:58:22 glidein_4647_188663238@107107-10803618-2988 LINUX X86_64 Claimed Busy 1.150 2500 0+01:27:43 glidein_4666_47581620@107107-10803618-6932 LINUX X86_64 Unclaimed Idle 0.150 2500 0+00:00:18 glidein_4648_530694856@107107-10803618-9013 LINUX X86_64 Claimed Busy 1.290 2500 0+02:44:29 glidein_4652_13214578@107107-10803618-13539 LINUX X86_64 Claimed Busy 1.330 2500 0+01:48:24 glidein_4649_228988870@107107-10803618-15298 LINUX X86_64 Claimed Busy 1.580 2500 0+01:12:38 glidein_4647_21341012@107107-10803618-23311 LINUX X86_64 Claimed Busy 1.610 2500 0+03:11:40 glidein_4513_225847896@176180-10679437-10022 LINUX X86_64 Claimed Busy 2.750 2500 0+01:23:28 glidein_4636_427199591@792560-10695588-13481 LINUX X86_64 Claimed Busy 0.000 2500 0+05:50:14 Machines Owner Claimed Unclaimed Matched Preempting Drain X86_64/LINUX 45 0 44 1 0 0 0 Total 45 0 44 1 0 0 0 I presume the error code (127) from the glidein is due to the problem with vocms0267 (our WMAgent) but on the other hand it is showing OK in WMStats. Investigating... Hmm, that's strange. Almost all of the VMs in the pool are from the same user -- Ah! I know who it is, I'll expect an email from him soon. |
17)
Message boards :
CMS Application :
Recent WMAgent update -- 17/11/2022
(Message 47539)
Posted 17 Nov 2022 by ivan Post: Queues are draining in preparation for another WMAgent upgrade. OK, update done, jobs should be available soon... |
18)
Message boards :
CMS Application :
Recent WMAgent update -- 17/11/2022
(Message 47537)
Posted 15 Nov 2022 by ivan Post: Queues are draining in preparation for another WMAgent upgrade. |
19)
Message boards :
CMS Application :
New Version 70.00
(Message 47489)
Posted 4 Nov 2022 by ivan Post: No, -dev was fine. It was the translation to the main project that threw up a problem. |
20)
Message boards :
CMS Application :
New Version 70.00
(Message 47476)
Posted 3 Nov 2022 by ivan Post: Each of my VMs starts with log entries like these: Ah, no, that is an error, on two counts! Federica and I just debugged it over a ZOOM meeting... Firstly, that ERROR msg should say "could not get an idtoken", and secondly the multiple requests for an idtoken indicates that this part of the script has a problem. It seems that the version of HTCondor that we are using happily continues with the x509 credentials even id the idtoken is not valid, which is why the jobs are actually acquired and run. More work for Laurence! |
©2023 CERN