Message boards : CMS Application : CMS Tasks Failing
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 15 · 16 · 17 · 18 · 19 · 20 · 21 . . . 22 · Next

AuthorMessage
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1009
Credit: 6,335,613
RAC: 2,115
Message 42731 - Posted: 1 Jun 2020, 11:12:29 UTC - in response to Message 42707.  

hello Ivan,
once more many thanks for your thorough reply :-)

So all we can do is to keep our fingers crossed that one day CMS may work again!

Yes, I'm afraid so. It was hard enough to co-ordinate debugging between the various disparate sub-systems in the best of times, but with the current disruptions it is even more difficult. Currently, updates are being pushed to our WMAgent server every two days, which can lead to problems like Friday night's (admittedly it _is_ the testbed server, but some stability would be nice).
Fortunately, the update pushed this morning seems to have undone the error introduced on Friday, so I can monitor the job queues again. Unfortunately, the pending queues are still not being replenished, so there are still task failures due to lack of runnable jobs. Currently we have about 60 jobs out in the field (12 of which are mine...) so there must be enough machines running on autopilot to pick up a job when it becomes available.
ID: 42731 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1307
Credit: 8,688,493
RAC: 5,089
Message 42733 - Posted: 1 Jun 2020, 12:33:22 UTC - in response to Message 42731.  
Last modified: 3 Jun 2020, 8:13:02 UTC

Currently we have about 60 jobs out in the field (12 of which are mine...) so there must be enough machines running on autopilot to pick up a job when it becomes available.
I catched a job of yours with jobID 7427.
ID: 42733 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1009
Credit: 6,335,613
RAC: 2,115
Message 42794 - Posted: 3 Jun 2020, 21:38:46 UTC - in response to Message 42731.  

OK, some good news for once. The post-production jobs on T3_CH_CMSAtHome (Laurence's VMs) had stalled. It seems they may have been affected by some IT problems at CERN last week; he updated the VM images this afternoon and now post-prod is running again.
More pertinently for the volunteers, we identified the problem that not enough jobs were being sent to the HTCondor queue -- a manual configuration tweak had been overlooked. That's been corrected and there are now several hundred jobs pending. As expected, this led to an upward spike in the number of running jobs since, for the moment at least, tasks aren't timing out and failing for want of a job from condor.
ID: 42794 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1307
Credit: 8,688,493
RAC: 5,089
Message 42796 - Posted: 4 Jun 2020, 12:03:19 UTC

No normal behaviour:

06/04/20 13:31:33 (pid:11950) CCBListener: no activity from CCB server in 2387s; assuming connection is dead.
06/04/20 13:31:33 (pid:11950) CCBListener: connection to CCB server vocms0840.cern.ch failed; will try to reconnect in 60 seconds.
06/04/20 13:32:34 (pid:11950) WARNING: forward resolution of 85.156.138.137.in-addr.arpa doesn't match 137.138.156.85!
06/04/20 13:32:34 (pid:11950) CCBListener: registered with CCB server vocms0840.cern.ch as ccbid 137.138.156.85:9618?addrs=137.138.156.85-9618#2123061
06/04/20 13:51:33 (pid:11950) No reconnect from shadow for 1200 seconds, aborting job execution!
06/04/20 13:51:33 (pid:11950) ShutdownFast all jobs.
06/04/20 13:51:33 (pid:11950) Process exited, pid=13019, signal=9
06/04/20 13:51:33 (pid:11950) Failed to send job exit status to shadow
06/04/20 13:51:33 (pid:11950) Last process exited, now Starter is exiting
06/04/20 13:51:33 (pid:11950) **** condor_starter (condor_STARTER) pid 11950 EXITING WITH STATUS 2

jobID 8626

followed by jobID 9347
ID: 42796 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 42797 - Posted: 4 Jun 2020, 13:59:04 UTC

Four of them are running fine for me (i7-4771, Win7 64-bit, VBox 5.2.42, BOINC 7.16.7).
They take up a lot of disk space, which is not a problem on my Linux machines. But this one has only 128 GB, so I need to increase it to run more.
ID: 42797 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2443
Credit: 230,717,753
RAC: 122,403
Message 42798 - Posted: 4 Jun 2020, 14:15:12 UTC

Didn't see a typical CMS error throughout the whole day.
Just a "postponed ..." error on one of my hosts that was running VirtualBox 6.1.8.
All others are running fine with VirtualBox 6.0.22.
ID: 42798 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1712
Credit: 106,235,067
RAC: 73,859
Message 42799 - Posted: 6 Jun 2020, 12:26:50 UTC

The ones I downloaded since yesterday worked out fine, VB 6.1.2
ID: 42799 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1009
Credit: 6,335,613
RAC: 2,115
Message 42807 - Posted: 7 Jun 2020, 13:13:26 UTC - in response to Message 42796.  

That looks like the behaviour I saw when I tried to run a task on my Windows 10 laptop at home, with my very limited network bandwidth (using a 4G WiFi modem in a mobile not-spot). Connection to CERN dropped out continually, and a couple of jobs just stopped partway through, but then a new job was obtained and started successfully.
ID: 42807 · Report as offensive     Reply Quote
PaoloNasca

Send message
Joined: 11 Jul 19
Posts: 6
Credit: 1,636,611
RAC: 477
Message 42809 - Posted: 7 Jun 2020, 19:39:04 UTC

Just to share and to comment what happened to two failed CMS WUs.

https://lhcathome.cern.ch/lhcathome/result.php?resultid=276498495
https://lhcathome.cern.ch/lhcathome/result.php?resultid=276571790

The VM has been being in "Running" state for 12 minutes (from 00:14:57 to 00:26:51).
Then the VM has been being in "Paused" state for 18 hours (66684 seconds, from 00:26:51 to 18:38:07).
After that, the VM has again been being in "Running" state for 12 minutes.
At 18:50:23, the log shows the error: "Condor ended after 66684 second".

I'm really surprised.
18 hours of "Paused" state, it's no sense.
I explain me better.
After 18 hours of inactivity some procedures could face some time-out issue.

<core_client_version>7.16.7</core_client_version>
<![CDATA[
<message>
Funzione non corretta.
(0x1) - exit code 1 (0x1)</message>
<stderr_txt>
2020-06-07 00:14:39 (12788): Detected: vboxwrapper 26197
2020-06-07 00:14:39 (12788): Detected: BOINC client v7.7
2020-06-07 00:14:40 (12788): Detected: VirtualBox VboxManage Interface (Version: 5.2.8)
...omitted....
2020-06-07 00:14:57 (12788): VM state change detected. (old = 'PoweredOff', new = 'Running')
...omitted....
2020-06-07 00:19:02 (12788): Guest Log: [DEBUG] HTCondor ping
2020-06-07 00:19:03 (12788): Guest Log: [DEBUG] 0
2020-06-07 00:26:51 (12788): VM state change detected. (old = 'Running', new = 'Paused')
2020-06-07 18:38:07 (12788): VM state change detected. (old = 'Paused', new = 'Running')
2020-06-07 18:38:08 (12788): Guest Log: 00:10:40.055696 timesync vgsvcTimeSyncWorker: Radical host time change: 65 481 854 000 000ns (HostNow=1 591 547 868 663 000 000 ns HostLast=1 591 482 386 809 000 000 ns)
2020-06-07 18:38:18 (12788): Guest Log: 00:10:50.056476 timesync vgsvcTimeSyncWorker: Radical guest time change: 65 470 267 403 000ns (GuestNow=1 591 547 878 663 789 000 ns GuestLast=1 591 482 408 396 386 000 ns fSetTimeLastLoop=true )
2020-06-07 18:50:23 (12788): Guest Log: [ERROR] Condor ended after 66684 seconds.
2020-06-07 18:50:23 (12788): Guest Log: [INFO] Shutting Down.
ID: 42809 · Report as offensive     Reply Quote
PaoloNasca

Send message
Joined: 11 Jul 19
Posts: 6
Credit: 1,636,611
RAC: 477
Message 42810 - Posted: 8 Jun 2020, 0:12:52 UTC

Only one CMS WU ended successfully.

I'd like to share a thought with Ivan and the developing team.

The WU will successfully end If the Elapsed Time is less than Job Duration (64800 sec = 18 hours).

My conclusion is: the VM has to be in “Running” state consecutively, I mean without any break/suspension.
What do you think to build a VM with 2 or more CPUs?
The Elapsed Time is inversely proportional to the CPU power.

Here below an extract from
https://lhcathome.cern.ch/lhcathome/result.php?resultid=276449837

….omitted….
2020-06-07 01:03:35 (12068): VM state change detected. (old = 'Running', new = 'Paused')
2020-06-07 01:08:35 (12068): VM state change detected. (old = 'Paused', new = 'Running')
2020-06-07 01:20:09 (12068): VM state change detected. (old = 'Running', new = 'Paused')
2020-06-07 01:29:16 (12068): VM state change detected. (old = 'Paused', new = 'Running')
2020-06-07 02:01:05 (12068): Status Report: Job Duration: '64800.000000'
2020-06-07 02:01:05 (12068): Status Report: Elapsed Time: '6000.992973'
2020-06-07 02:01:05 (12068): Status Report: CPU Time: '5208.796875'
2020-06-07 03:41:48 (12068): Status Report: Job Duration: '64800.000000'
2020-06-07 03:41:48 (12068): Status Report: Elapsed Time: '12000.992973'
2020-06-07 03:41:48 (12068): Status Report: CPU Time: '10897.078125'
2020-06-07 05:22:14 (12068): Status Report: Job Duration: '64800.000000'
2020-06-07 05:22:14 (12068): Status Report: Elapsed Time: '18000.992973'
2020-06-07 05:22:14 (12068): Status Report: CPU Time: '16861.296875'
2020-06-07 07:02:59 (12068): Status Report: Job Duration: '64800.000000'
2020-06-07 07:02:59 (12068): Status Report: Elapsed Time: '24000.992973'
2020-06-07 07:02:59 (12068): Status Report: CPU Time: '22545.656250'
2020-06-07 08:43:26 (12068): Status Report: Job Duration: '64800.000000'
2020-06-07 08:43:26 (12068): Status Report: Elapsed Time: '30001.494833'
2020-06-07 08:43:26 (12068): Status Report: CPU Time: '28544.906250'
2020-06-07 10:09:12 (12068): VM state change detected. (old = 'Running', new = 'Paused')
2020-06-07 10:16:02 (12068): VM state change detected. (old = 'Paused', new = 'Running')
2020-06-07 10:30:53 (12068): Status Report: Job Duration: '64800.000000'
2020-06-07 10:30:53 (12068): Status Report: Elapsed Time: '36001.494833'
2020-06-07 10:30:53 (12068): Status Report: CPU Time: '34246.296875'
2020-06-07 12:11:26 (12068): Status Report: Job Duration: '64800.000000'
2020-06-07 12:11:26 (12068): Status Report: Elapsed Time: '42001.494833'
2020-06-07 12:11:26 (12068): Status Report: CPU Time: '39958.921875'
2020-06-07 13:52:11 (12068): Status Report: Job Duration: '64800.000000'
2020-06-07 13:52:11 (12068): Status Report: Elapsed Time: '48001.494833'
2020-06-07 13:52:11 (12068): Status Report: CPU Time: '45994.109375'
2020-06-07 15:32:39 (12068): Status Report: Job Duration: '64800.000000'
2020-06-07 15:32:39 (12068): Status Report: Elapsed Time: '54001.494833'
2020-06-07 15:32:39 (12068): Status Report: CPU Time: '48189.859375'
2020-06-07 17:12:49 (12068): Status Report: Job Duration: '64800.000000'
2020-06-07 17:12:49 (12068): Status Report: Elapsed Time: '60001.494833'
2020-06-07 17:12:49 (12068): Status Report: CPU Time: '48252.187500'
2020-06-07 18:32:55 (12068): Powering off VM.
ID: 42810 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1307
Credit: 8,688,493
RAC: 5,089
Message 42811 - Posted: 8 Jun 2020, 8:58:31 UTC - in response to Message 42810.  

What do you think to build a VM with 2 or more CPUs?
The task will not run twice as fast.
The CMS-VM is created in such a way, that when the VM has 2 cores, it will run 2 CMS-jobs inside the VM, so it will need as long as 1 single core VM doing 2 jobs on 2 cores.
Did you have a look into the Console (ALT-F2) or Graphics (logfile) to see how many events were processed before the task was killed?
ID: 42811 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2443
Credit: 230,717,753
RAC: 122,403
Message 42812 - Posted: 8 Jun 2020, 9:05:26 UTC - in response to Message 42810.  

What you see is the correct behavior.


CMS tasks are an envelope to run "jobs", also called "subtasks".
Typical runtimes per jobs are roughly 2.5-3 h.

CMS tasks are designed to run 12h.
Tasks hitting this limit will finish the running job and then finish itself.

The 18h you see is a watchdog limit.
Every task hitting this limit will shutdown regardless of the process state inside the VM.
This avoids tasks running forever if something inside the VM is trapped in an endless loop.
ID: 42812 · Report as offensive     Reply Quote
PaoloNasca

Send message
Joined: 11 Jul 19
Posts: 6
Credit: 1,636,611
RAC: 477
Message 42813 - Posted: 8 Jun 2020, 9:43:57 UTC - in response to Message 42811.  

Did you have a look into the Console (ALT-F2) or Graphics (logfile) to see how many events were processed before the task was killed?


I don’t know how to read what you asked me.
Please give me more info and I’ll be very happy and proud to help the development team.
Thanks for your time.
ID: 42813 · Report as offensive     Reply Quote
PaoloNasca

Send message
Joined: 11 Jul 19
Posts: 6
Credit: 1,636,611
RAC: 477
Message 42814 - Posted: 8 Jun 2020, 10:02:26 UTC - in response to Message 42812.  

CMS tasks are designed to run 12h.
Tasks hitting this limit will finish the running job and then finish itself.
The 18h you see is a watchdog limit.


Are you meaning the CMS tasks need a minimum CPU power?
I explain me better.
The first fine ended CMS task lasted near 17 hours.
I think my PC isn’t suitable to help CMS@home project.
Your reply is really appreciated.
Thanks
ID: 42814 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1307
Credit: 8,688,493
RAC: 5,089
Message 42815 - Posted: 8 Jun 2020, 11:09:36 UTC - in response to Message 42813.  
Last modified: 8 Jun 2020, 11:11:13 UTC

I don’t know how to read what you asked me.
Please give me more info and I’ll be very happy and proud to help the development team.
The current CMS-jobs running inside the VM have to process 10,000 events before the next job should start when there is time left before the first 12 hours elapsed.
To see the progress you may use BOINC Manager. When highlighting a running CMS-task you see on the left below 'Commands' a button 'Show graphics'.
That give you access to a logfile inside the VM.
For the other option I mentioned (Console), you have to install VirtualBox Extention Pack of the same VirtualBox version that you are using. You can get it from virtualbox.org

Btw: Give a CMS-task 100% of the CPU - No throttling
ID: 42815 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1009
Credit: 6,335,613
RAC: 2,115
Message 42817 - Posted: 8 Jun 2020, 12:07:08 UTC - in response to Message 42810.  

What do you think to build a VM with 2 or more CPUs?
The Elapsed Time is inversely proportional to the CPU power.
.

We've done that on CMS@Home-dev, but instead of running N CPUs on one task, they run N jobs -- unfortunately, not all CMS jobs can run thread-safely (a particular concern with my other hat on, as a developer on the Phase-2 Level-1 Track-Trigger project). In my experience, there was enough overheads (staggering start-up so as not to overload the internet connection, jobs finishing earlier leaving the CPU idle while the other job(s) complete, etc.) that the optimum number of CPUs was N=2.
ID: 42817 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1009
Credit: 6,335,613
RAC: 2,115
Message 42839 - Posted: 11 Jun 2020, 22:30:53 UTC - in response to Message 42812.  

What you see is the correct behavior.


CMS tasks are an envelope to run "jobs", also called "subtasks".
Typical runtimes per jobs are roughly 2.5-3 h.

CMS tasks are designed to run 12h.
Tasks hitting this limit will finish the running job and then finish itself.

The 18h you see is a watchdog limit.
Every task hitting this limit will shutdown regardless of the process state inside the VM.
This avoids tasks running forever if something inside the VM is trapped in an endless loop.

There is another time-out to consider, which should only come into play if you suspend BOINC (presumably to shut down your computer -- if you do so, please make sure all BOINC programmes finish before switching off!). WMAgent kills a "running" job after 47 hours of "wall-time" (i.e. real-world elapsed time); if your task is suspended HTCondor and WMAgent don't know that and assume its job is still running.
ID: 42839 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1009
Credit: 6,335,613
RAC: 2,115
Message 42840 - Posted: 11 Jun 2020, 22:33:26 UTC - in response to Message 42807.  

That looks like the behaviour I saw when I tried to run a task on my Windows 10 laptop at home, with my very limited network bandwidth (using a 4G WiFi modem in a mobile not-spot). Connection to CERN dropped out continually, and a couple of jobs just stopped partway through, but then a new job was obtained and started successfully.

Is this (jobs stopping part-way through, but the task continuing) still being observed? I tried to provoke it last night but everything seemed to run tickety-boo.
ID: 42840 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1009
Credit: 6,335,613
RAC: 2,115
Message 42841 - Posted: 12 Jun 2020, 1:27:41 UTC - in response to Message 42840.  
Last modified: 12 Jun 2020, 1:28:34 UTC

That looks like the behaviour I saw when I tried to run a task on my Windows 10 laptop at home, with my very limited network bandwidth (using a 4G WiFi modem in a mobile not-spot). Connection to CERN dropped out continually, and a couple of jobs just stopped partway through, but then a new job was obtained and started successfully.

Is this (jobs stopping part-way through, but the task continuing) still being observed? I tried to provoke it last night but everything seemed to run tickety-boo.

Hah, it looks like I caught one tonight! I set up a script to capture the log files every five minutes, and the first job stopped at around 9,500 events before the task then started a new job successfully. It's way past my bedtime, I'l leave it until tomorrow to cross-reference the logs to try to find out what went awry.
ID: 42841 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1009
Credit: 6,335,613
RAC: 2,115
Message 42849 - Posted: 12 Jun 2020, 15:22:13 UTC - in response to Message 42841.  

That looks like the behaviour I saw when I tried to run a task on my Windows 10 laptop at home, with my very limited network bandwidth (using a 4G WiFi modem in a mobile not-spot). Connection to CERN dropped out continually, and a couple of jobs just stopped partway through, but then a new job was obtained and started successfully.

Is this (jobs stopping part-way through, but the task continuing) still being observed? I tried to provoke it last night but everything seemed to run tickety-boo.

Hah, it looks like I caught one tonight! I set up a script to capture the log files every five minutes, and the first job stopped at around 9,500 events before the task then started a new job successfully. It's way past my bedtime, I'l leave it until tomorrow to cross-reference the logs to try to find out what went awry.

OK, here's the analysis. Executive summary: jobs are aborted if loss of communication exceeds 20 minutes, but the task does not see this as a failure.

The logs can be seen at https://www.brunel.ac.uk/~eesridr/CMSAtHome_13_23_56/logs/. The output from the jobs are seen in finished_[1-6].log and running.log (the last job to run). The StartLog, StarterLog and MasterLog are the respective condor logs. Note that finished_1.log is smaller than the others. This is because it only generated 9,358 events instead of the requested 10,000:
Begin processing the 9358th record. Run 1, Event 30609358, LumiSection 61219 on stream 0 at 12-Jun-2020 00:10:26.443 UTC

The BOINC report on this task is at
https://lhcathome.cern.ch/lhcathome/result.php?resultid=277290914
. It shows nothing exceptional for a task which completed normally.
It can be seen in the condor logs that there are several breaks in communication back to CERN, e.g.
06/12/20 02:16:29 (pid:1519) condor_read() failed: recv(fd=10) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <137.138.52.94:4080>.
06/12/20 02:16:29 (pid:1519) IO: Failed to read packet header
06/12/20 02:16:29 (pid:1519) Lost connection to shadow, waiting 1200 secs for reconnect
06/12/20 02:17:36 (pid:1519) Accepted request to reconnect from <:0>
06/12/20 02:17:36 (pid:1519) Ignoring old shadow <137.138.52.94:4080?addrs=137.138.52.94-4080&alias=vocms0267.cern.ch&noUDP&sock=shadow_2424955_eb4b_15537>
06/12/20 02:17:36 (pid:1519) Communicating with shadow <137.138.52.94:4080?addrs=137.138.52.94 4080&alias=vocms0267.cern.ch&noUDP&sock=shadow_2424955_eb4b_15537>
06/12/20 02:17:36 (pid:1519) Recovered connection to shadow after 67 seconds

I'm not sure how many of these network problems are due to my limited and patchy network connection at home, but similar problems have been seen elsewhere; I know that there are other volunteers who also have limited networking. In the case of the short-running job, the failure occurred about 00:10:26.443 UTC, or 01:10:26 local. In StarterLog we find:
06/12/20 00:50:26 (pid:20894) IO: Failed to read packet header
06/12/20 00:50:26 (pid:20894) Lost connection to shadow, waiting 1200 secs for reconnect
06/12/20 01:10:26 (pid:20894) No reconnect from shadow for 1200 seconds, aborting job execution!
06/12/20 01:10:26 (pid:20894) ShutdownFast all jobs.
06/12/20 01:10:26 (pid:20894) Process exited, pid=28703, signal=9
06/12/20 01:10:26 (pid:20894) Failed to send job exit status to shadow
06/12/20 01:10:26 (pid:20894) Last process exited, now Starter is exiting
so this short job is caused by a failure to reconnect within 20 minutes, as surmised.
ID: 42849 · Report as offensive     Reply Quote
Previous · 1 . . . 15 · 16 · 17 · 18 · 19 · 20 · 21 . . . 22 · Next

Message boards : CMS Application : CMS Tasks Failing


©2024 CERN