1) Message boards : CMS Application : EXIT_NO_SUB_TASKS (Message 47726)
Posted 11 days ago by ivan
Post:
A new workflow is in the pipeline. Jobs should be available in about one hour.
2) Message boards : CMS Application : EXIT_NO_SUB_TASKS (Message 47718)
Posted 13 days ago by ivan
Post:
Updated calculation: the queues should be almost drained in 24 hours from now.
3) Message boards : CMS Application : EXIT_NO_SUB_TASKS (Message 47713)
Posted 15 days ago by ivan
Post:
I'll be running down the queues at the weekend in preparation for a WMAgent upgrade next week. Be prepared to set No New Tasks on Sunday or so.

Oops, I miscalculated (based on 10,000 jobs rather than 20,000) -- we'll run out of jobs Monday into Tuesday.
4) Message boards : CMS Application : EXIT_NO_SUB_TASKS (Message 47712)
Posted 16 days ago by ivan
Post:
I'll be running down the queues at the weekend in preparation for a WMAgent upgrade next week. Be prepared to set No New Tasks on Sunday or so.
5) Message boards : CMS Application : CMS computation error in 30 seconds every time (Message 47693)
Posted 18 days ago by ivan
Post:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=377315268

Looks like the P.H. LAN as a whole is now misconfigured.
The publicly available logfiles don't tell why but there's now way to complete a CMS subtask within only 2:30 min CPU-time.
My guess would be that some internet data requested by deeper level scripts can't be downloaded (=> timeout) and the error doesn't arrive at the BOINC level.
It would require a CERN expert to look through those deeper level logs.

    2023-01-15 04:45:14 (2360): Guest Log: [INFO] Requesting an idtoken from LHC@home
    2023-01-15 05:22:19 (2360): Guest Log: [INFO] glidein exited with return value 0.


Looks to me like it timed out retrieving the idtoken.


@P.H.
Since you set up very unusual packet redirections you may have forgotten to forward all required ports in both directions.
6) Message boards : CMS Application : CMS computation error in 30 seconds every time (Message 47665)
Posted 23 days ago by ivan
Post:
Rough estimation:
A 7 Mbit/s upload bandwidth will be fully saturated by 50 CMS VMs running concurrently.
30. But I guess it depends on CPU speed. I have 126 cores :-/

Did CMS used to work like this? I don't recall having a bandwidth problem before.

Why can't it upload them at the end using Boinc in the normal way in the queue instead of stalling the processing?

Is there an easy way to get equal numbers of Theory and Atlas aswell? I've asked for anything and only get CMS. If I could do some of each, and the others don't have the same bandwidth requirements, I could run all cores on LHC.

From your latest good task:
2023-01-11 20:04:50 (7092): Guest Log: [INFO] Could not find a local HTTP proxy
2023-01-11 20:04:50 (7092): Guest Log: [INFO] CVMFS and Frontier will have to use DIRECT connections
2023-01-11 20:04:50 (7092): Guest Log: [INFO] This makes the application less efficient
2023-01-11 20:04:50 (7092): Guest Log: [INFO] It also puts higher load on the project servers
2023-01-11 20:04:50 (7092): Guest Log: [INFO] Setting up a local HTTP proxy is highly recommended
2023-01-11 20:04:51 (7092): Guest Log: [INFO] Advice can be found in the project forum
Remember that you download a lot of data as well. If you had a local squid proxy you could greatly reduce your downloads and the consequent time spent. I think the instructions are in the Number Crunching forum.
7) Message boards : CMS Application : CMS computation error in 30 seconds every time (Message 47658)
Posted 23 days ago by ivan
Post:
Unfortunately the reason is not reported back to higher script levels, hence BOINC gets a 'success'.
That's worrying, they're marked as valid in my list of tasks on the server. Does the system later notice a problem and resend those tasks?
In general, yes. This runs at two levels -- WMAgent produces jobs and sends them to the HTCondor server. Your VM instances (i.e. BOINC tasks) ask the condor server for a job. If that job terminates with an error, or your VM goes out of contact from the server for too long (currently two hours), the condor server requeues it and sends it to a new VM when the queueing allows. If there are several errors for the one job (currently three, IIRC) the condor server notifies the WMAgent which then itself requeues the job for future resubmission back to the condor server. If the job terminates without error, then the VM will ask for another job, up until 12+ hours have elapsed in total.
Or does it royally screw up the science?
No, if you look at the job graphs from the homepage, you will see 5-10% job failures, but these are the primary failures seen by condor. In the (unfortunately non-public) monitoring we see the ultimate failure rate is essentially zero for every 20,000-job workflow submission. We do tend to be generous, and allow credits for CPU time given even if we could detect a failure -- as alluded to above -- but egregious errors will not get any credit, though.
I will of course be keeping an eye on the run time vs CPU time on the server list of tasks to make sure mine behave and are useful.
You should, ideally, be seeing task logs that look like mine -- https://lhcathome.cern.ch/lhcathome/results.php?userid=14095 -- tasks running for 12 hours or so with CPU time slightly less. Each task (VM instance) is therefore running 5 or 6 2-hour CMS jobs before terminating to allow BOINC to start a new task.
8) Message boards : CMS Application : EXIT_NO_SUB_TASKS (Message 47629)
Posted 30 Dec 2022 by ivan
Post:
Anybody there to refill the CMS queue this year?

Yes, it's just that we've been running more jobs than usual lately, and I have been taking it easy...
New batch in the pipeline.

Jobs are available again now.
9) Message boards : CMS Application : EXIT_NO_SUB_TASKS (Message 47628)
Posted 30 Dec 2022 by ivan
Post:
Anybody there to refill the CMS queue this year?

Yes, it's just that we've been running more jobs than usual lately, and I have been taking it easy...
New batch in the pipeline.
10) Message boards : CMS Application : EXIT_NO_SUB_TASKS (Message 47581)
Posted 7 Dec 2022 by ivan
Post:
A new workflow submitted last night is stuck in "staging" rather than progressing to "running" so we are running out of jobs. Probably best to set No New Tasks until it's sorted. I've submitted another workflow, but I don't think it will bypass the older one in the queue. WMCore team have been notified.
More later.

The logjam has just been cleared and the task server has noticed that jobs are available, and is sending out tasks again -- I've just got two on my first machine.

The underlying cause is detailed in https://github.com/dmwm/WMCore/issues/11386 and the cure in https://github.com/dmwm/WMCore/pull/11387.
All my machines now are running a quota of tasks.
11) Message boards : CMS Application : EXIT_NO_SUB_TASKS (Message 47580)
Posted 7 Dec 2022 by ivan
Post:
A new workflow submitted last night is stuck in "staging" rather than progressing to "running" so we are running out of jobs. Probably best to set No New Tasks until it's sorted. I've submitted another workflow, but I don't think it will bypass the older one in the queue. WMCore team have been notified.
More later.

The logjam has just been cleared and the task server has noticed that jobs are available, and is sending out tasks again -- I've just got two on my first machine.
12) Message boards : CMS Application : EXIT_NO_SUB_TASKS (Message 47577)
Posted 6 Dec 2022 by ivan
Post:
A new workflow submitted last night is stuck in "staging" rather than progressing to "running" so we are running out of jobs. Probably best to set No New Tasks until it's sorted. I've submitted another workflow, but I don't think it will bypass the older one in the queue. WMCore team have been notified.
More later.
13) Message boards : CMS Application : CMS Simulation error (Message 47564)
Posted 25 Nov 2022 by ivan
Post:
From a responsible:
Exit code 127 usually means that the executable wasn't found which gives a big clue.

It seems we might have picked up some other changes when we modified the hibernation time-out. Investigations continue.

We've reverted the change that garbled our glidein script -- I'm running main and -dev jobs successfully now.
14) Message boards : CMS Application : CMS Simulation error (Message 47563)
Posted 25 Nov 2022 by ivan
Post:
From a responsible:
Exit code 127 usually means that the executable wasn't found which gives a big clue.

It seems we might have picked up some other changes when we modified the hibernation time-out. Investigations continue.
15) Message boards : CMS Application : CMS Simulation error (Message 47562)
Posted 25 Nov 2022 by ivan
Post:
We changed a condor parameter yesterday, altering the time that condor waited for a disconnected machine to resume from one day to two hours. It is possible that this is conflicting with some other requirement, leading glidein to abort. Emails have been sent and I'm searching for glidein code to see what error 127 indicates.
16) Message boards : CMS Application : CMS Simulation error (Message 47561)
Posted 25 Nov 2022 by ivan
Post:
Yes, there's a problem.
No, I haven't found it yet.
If I try to look at our condor jobs from CERN I get
[lxplus780:~] > condor_q -name vocms0267.cern.ch -pool vocms0840.cern.ch 

-- Failed to fetch ads from: <137.138.52.94:4080?addrs=137.138.52.94-4080+[2001-1458-d00-17--43]-4080&alias=vocms0267.cern.ch&noUDP&sock=schedd_1783_9173> : vocms0267.cern.ch
AUTHENTICATE:1003:Failed to authenticate with any method

and the pool is depleted:
[lxplus780:~] > condor_status -pool vocms0840.cern.ch
Name                                         OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

glidein_4670_387833785@107107-10489077-18060 LINUX      X86_64 Claimed   Busy      2.330 2500  0+01:00:46
glidein_4680_323803200@107107-10489077-31377 LINUX      X86_64 Claimed   Busy      3.070 2500  0+03:49:52
glidein_4648_45869712@107107-10511611-11990  LINUX      X86_64 Claimed   Busy      1.250 2500  0+01:03:09
glidein_4656_822528270@107107-10511611-15555 LINUX      X86_64 Claimed   Busy      1.210 2500  0+02:43:25
glidein_4653_323462880@107107-10511611-16804 LINUX      X86_64 Claimed   Busy      1.430 2500  0+03:35:45
glidein_4646_361662954@107107-10511611-18323 LINUX      X86_64 Claimed   Busy      1.210 2500  0+03:24:27
glidein_4648_563620040@107107-10511611-22062 LINUX      X86_64 Claimed   Busy      2.260 2500  0+02:32:14
glidein_4664_267919542@107107-10511611-30334 LINUX      X86_64 Claimed   Busy      1.260 2500  0+01:59:35
glidein_4656_121852692@107107-10511611-30765 LINUX      X86_64 Claimed   Busy      2.260 2500  0+00:11:12
glidein_4670_72564800@107107-10511679-3097   LINUX      X86_64 Claimed   Busy      2.270 2500  0+00:37:28
glidein_4657_476277228@107107-10511679-15034 LINUX      X86_64 Claimed   Busy      1.690 2500  0+00:40:13
glidein_4656_29749126@107107-10511679-18362  LINUX      X86_64 Claimed   Busy      2.310 2500  0+02:11:44
glidein_4645_96284106@107107-10511679-21818  LINUX      X86_64 Claimed   Busy      1.500 2500  0+01:00:07
glidein_4874_26963200@107107-10511679-23757  LINUX      X86_64 Claimed   Busy      1.420 2500  0+01:33:39
glidein_4656_572271750@107107-10511679-29938 LINUX      X86_64 Claimed   Busy      1.530 2500  0+03:33:02
glidein_4661_72786555@107107-10511679-31672  LINUX      X86_64 Claimed   Busy      0.810 2500  0+00:01:37
glidein_4651_831188082@107107-10574756-27824 LINUX      X86_64 Claimed   Busy      1.850 2500  0+01:08:16
glidein_4646_553968128@107107-10574756-28110 LINUX      X86_64 Claimed   Busy      1.500 2500  0+01:59:27
glidein_4645_266261752@107107-10574756-28244 LINUX      X86_64 Claimed   Busy      2.010 2500  0+01:58:33
glidein_4644_353209032@107107-10574756-29278 LINUX      X86_64 Claimed   Busy      1.740 2500  0+02:51:58
glidein_4649_107372800@107107-10574756-29861 LINUX      X86_64 Claimed   Busy      1.770 2500  0+00:35:45
glidein_4646_432356158@107107-10574756-29943 LINUX      X86_64 Claimed   Busy      1.910 2500  0+01:10:31
glidein_4647_591695930@107107-10574756-31949 LINUX      X86_64 Claimed   Busy      1.740 2500  0+01:24:21
glidein_4650_109275384@107107-10574767-7200  LINUX      X86_64 Claimed   Busy      1.760 2500  0+03:16:18
glidein_4654_146209624@107107-10574767-7326  LINUX      X86_64 Claimed   Busy      2.040 2500  0+02:10:39
glidein_4651_667607907@107107-10574767-11294 LINUX      X86_64 Claimed   Busy      2.230 2500  0+02:30:02
glidein_4643_676121242@107107-10574767-11526 LINUX      X86_64 Claimed   Busy      0.000 2500  0+00:00:00
glidein_4750_71875323@107107-10574767-22926  LINUX      X86_64 Claimed   Busy      1.730 2500  0+02:27:44
glidein_4653_455226972@107107-10574767-24067 LINUX      X86_64 Claimed   Busy      1.410 2500  0+00:22:12
glidein_4652_527975600@107107-10574767-24820 LINUX      X86_64 Claimed   Busy      1.440 2500  0+00:32:53
glidein_4647_415785516@107107-10574816-7681  LINUX      X86_64 Claimed   Busy      1.660 2500  0+01:57:45
glidein_4652_241555920@107107-10574816-12265 LINUX      X86_64 Claimed   Busy      1.970 2500  0+00:36:03
glidein_4650_91299585@107107-10574816-12578  LINUX      X86_64 Claimed   Busy      1.030 2500  0+03:04:33
glidein_4649_824674365@107107-10574816-18524 LINUX      X86_64 Claimed   Busy      0.240 2500  0+00:00:00
glidein_4651_914472658@107107-10574816-23179 LINUX      X86_64 Claimed   Busy      1.770 2500  0+01:32:32
glidein_4667_286539704@107107-10574816-29186 LINUX      X86_64 Claimed   Busy      2.400 2500  0+01:10:41
glidein_4645_450114845@107107-10803618-56    LINUX      X86_64 Claimed   Busy      1.490 2500  0+00:58:22
glidein_4647_188663238@107107-10803618-2988  LINUX      X86_64 Claimed   Busy      1.150 2500  0+01:27:43
glidein_4666_47581620@107107-10803618-6932   LINUX      X86_64 Unclaimed Idle      0.150 2500  0+00:00:18
glidein_4648_530694856@107107-10803618-9013  LINUX      X86_64 Claimed   Busy      1.290 2500  0+02:44:29
glidein_4652_13214578@107107-10803618-13539  LINUX      X86_64 Claimed   Busy      1.330 2500  0+01:48:24
glidein_4649_228988870@107107-10803618-15298 LINUX      X86_64 Claimed   Busy      1.580 2500  0+01:12:38
glidein_4647_21341012@107107-10803618-23311  LINUX      X86_64 Claimed   Busy      1.610 2500  0+03:11:40
glidein_4513_225847896@176180-10679437-10022 LINUX      X86_64 Claimed   Busy      2.750 2500  0+01:23:28
glidein_4636_427199591@792560-10695588-13481 LINUX      X86_64 Claimed   Busy      0.000 2500  0+05:50:14

               Machines Owner Claimed Unclaimed Matched Preempting  Drain

  X86_64/LINUX       45     0      44         1       0          0      0

         Total       45     0      44         1       0          0      0

I presume the error code (127) from the glidein is due to the problem with vocms0267 (our WMAgent) but on the other hand it is showing OK in WMStats.
Investigating...
Hmm, that's strange. Almost all of the VMs in the pool are from the same user -- Ah! I know who it is, I'll expect an email from him soon.
17) Message boards : CMS Application : Recent WMAgent update -- 17/11/2022 (Message 47539)
Posted 17 Nov 2022 by ivan
Post:
Queues are draining in preparation for another WMAgent upgrade.

OK, update done, jobs should be available soon...
18) Message boards : CMS Application : Recent WMAgent update -- 17/11/2022 (Message 47537)
Posted 15 Nov 2022 by ivan
Post:
Queues are draining in preparation for another WMAgent upgrade.
19) Message boards : CMS Application : New Version 70.00 (Message 47489)
Posted 4 Nov 2022 by ivan
Post:
No, -dev was fine. It was the translation to the main project that threw up a problem.
20) Message boards : CMS Application : New Version 70.00 (Message 47476)
Posted 3 Nov 2022 by ivan
Post:
Each of my VMs starts with log entries like these:
2022-11-01 12:50:43 (39334): Guest Log: [INFO] CMS application starting. Check log files.
2022-11-01 12:50:43 (39334): Guest Log: [INFO] Requesting an idtoken from LHC@home
2022-11-01 12:50:44 (39334): Guest Log: [INFO] Requesting an idtoken from vLHC@home-dev
2022-11-01 12:51:15 (39334): Guest Log: [INFO] Requesting an idtoken from LHC@home
2022-11-01 12:51:15 (39334): Guest Log: [INFO] Requesting an idtoken from vLHC@home-dev
2022-11-01 12:51:45 (39334): Guest Log: [INFO] Requesting an idtoken from LHC@home
2022-11-01 12:51:46 (39334): Guest Log: [INFO] Requesting an idtoken from vLHC@home-dev
2022-11-01 12:52:17 (39334): Guest Log: [INFO] Requesting an idtoken from LHC@home
2022-11-01 12:52:17 (39334): Guest Log: [INFO] Requesting an idtoken from vLHC@home-dev
2022-11-01 12:52:48 (39334): Guest Log: [INFO] Requesting an idtoken from LHC@home
2022-11-01 12:52:49 (39334): Guest Log: [INFO] Requesting an idtoken from vLHC@home-dev
2022-11-01 12:53:20 (39334): Guest Log: [INFO] Requesting an idtoken from LHC@home
2022-11-01 12:53:20 (39334): Guest Log: [INFO] Requesting an idtoken from vLHC@home-dev
2022-11-01 12:53:52 (39334): Guest Log: [DEBUG]   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
2022-11-01 12:53:52 (39334): Guest Log:                                  Dload  Upload   Total   Spent    Left  Speed
2022-11-01 12:53:52 (39334): Guest Log:   0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
2022-11-01 12:53:52 (39334): Guest Log: 100   221  100   221    0     0    436      0 --:--:-- --:--:-- --:--:--   437
2022-11-01 12:53:52 (39334): Guest Log: [DEBUG]   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
2022-11-01 12:53:52 (39334): Guest Log:                                  Dload  Upload   Total   Spent    Left  Speed
2022-11-01 12:53:52 (39334): Guest Log:   0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
2022-11-01 12:53:52 (39334): Guest Log: 100   221  100   221    0     0    436      0 --:--:-- --:--:-- --:--:--   437
2022-11-01 12:53:52 (39334): Guest Log: [ERROR] Could not get an x509 credential

Nonetheless the CMS jobs seem to run fine.

Ah, no, that is an error, on two counts! Federica and I just debugged it over a ZOOM meeting... Firstly, that ERROR msg should say "could not get an idtoken", and secondly the multiple requests for an idtoken indicates that this part of the script has a problem.
It seems that the version of HTCondor that we are using happily continues with the x509 credentials even id the idtoken is not valid, which is why the jobs are actually acquired and run.
More work for Laurence!


Next 20


©2023 CERN