CMS Tasks Failing

Author	Message
ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1005 Credit: 6,271,748 RAC: 428	Message 42731 - Posted: 1 Jun 2020, 11:12:29 UTC - in response to Message 42707. hello Ivan, once more many thanks for your thorough reply :-) So all we can do is to keep our fingers crossed that one day CMS may work again! Yes, I'm afraid so. It was hard enough to co-ordinate debugging between the various disparate sub-systems in the best of times, but with the current disruptions it is even more difficult. Currently, updates are being pushed to our WMAgent server every two days, which can lead to problems like Friday night's (admittedly it _is_ the testbed server, but some stability would be nice). Fortunately, the update pushed this morning seems to have undone the error introduced on Friday, so I can monitor the job queues again. Unfortunately, the pending queues are still not being replenished, so there are still task failures due to lack of runnable jobs. Currently we have about 60 jobs out in the field (12 of which are mine...) so there must be enough machines running on autopilot to pick up a job when it becomes available. ID: 42731 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1279 Credit: 8,482,293 RAC: 1,690	Message 42733 - Posted: 1 Jun 2020, 12:33:22 UTC - in response to Message 42731. Last modified: 3 Jun 2020, 8:13:02 UTC Currently we have about 60 jobs out in the field (12 of which are mine...) so there must be enough machines running on autopilot to pick up a job when it becomes available. I catched a job of yours with jobID 7427. ID: 42733 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1005 Credit: 6,271,748 RAC: 428	Message 42794 - Posted: 3 Jun 2020, 21:38:46 UTC - in response to Message 42731. OK, some good news for once. The post-production jobs on T3_CH_CMSAtHome (Laurence's VMs) had stalled. It seems they may have been affected by some IT problems at CERN last week; he updated the VM images this afternoon and now post-prod is running again. More pertinently for the volunteers, we identified the problem that not enough jobs were being sent to the HTCondor queue -- a manual configuration tweak had been overlooked. That's been corrected and there are now several hundred jobs pending. As expected, this led to an upward spike in the number of running jobs since, for the moment at least, tasks aren't timing out and failing for want of a job from condor. ID: 42794 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1279 Credit: 8,482,293 RAC: 1,690	Message 42796 - Posted: 4 Jun 2020, 12:03:19 UTC No normal behaviour: 06/04/20 13:31:33 (pid:11950) CCBListener: no activity from CCB server in 2387s; assuming connection is dead. 06/04/20 13:31:33 (pid:11950) CCBListener: connection to CCB server vocms0840.cern.ch failed; will try to reconnect in 60 seconds. 06/04/20 13:32:34 (pid:11950) WARNING: forward resolution of 85.156.138.137.in-addr.arpa doesn't match 137.138.156.85! 06/04/20 13:32:34 (pid:11950) CCBListener: registered with CCB server vocms0840.cern.ch as ccbid 137.138.156.85:9618?addrs=137.138.156.85-9618#2123061 06/04/20 13:51:33 (pid:11950) No reconnect from shadow for 1200 seconds, aborting job execution! 06/04/20 13:51:33 (pid:11950) ShutdownFast all jobs. 06/04/20 13:51:33 (pid:11950) Process exited, pid=13019, signal=9 06/04/20 13:51:33 (pid:11950) Failed to send job exit status to shadow 06/04/20 13:51:33 (pid:11950) Last process exited, now Starter is exiting 06/04/20 13:51:33 (pid:11950) **** condor_starter (condor_STARTER) pid 11950 EXITING WITH STATUS 2 jobID 8626 followed by jobID 9347 ID: 42796 · Reply Quote

Jim1348 Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0	Message 42797 - Posted: 4 Jun 2020, 13:59:04 UTC Four of them are running fine for me (i7-4771, Win7 64-bit, VBox 5.2.42, BOINC 7.16.7). They take up a lot of disk space, which is not a problem on my Linux machines. But this one has only 128 GB, so I need to increase it to run more. ID: 42797 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2410 Credit: 225,987,175 RAC: 125,553	Message 42798 - Posted: 4 Jun 2020, 14:15:12 UTC Didn't see a typical CMS error throughout the whole day. Just a "postponed ..." error on one of my hosts that was running VirtualBox 6.1.8. All others are running fine with VirtualBox 6.0.22. ID: 42798 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1688 Credit: 103,449,808 RAC: 117,802	Message 42799 - Posted: 6 Jun 2020, 12:26:50 UTC The ones I downloaded since yesterday worked out fine, VB 6.1.2 ID: 42799 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1005 Credit: 6,271,748 RAC: 428	Message 42807 - Posted: 7 Jun 2020, 13:13:26 UTC - in response to Message 42796. That looks like the behaviour I saw when I tried to run a task on my Windows 10 laptop at home, with my very limited network bandwidth (using a 4G WiFi modem in a mobile not-spot). Connection to CERN dropped out continually, and a couple of jobs just stopped partway through, but then a new job was obtained and started successfully. ID: 42807 · Reply Quote

PaoloNasca Send message Joined: 11 Jul 19 Posts: 6 Credit: 1,622,748 RAC: 681	Message 42809 - Posted: 7 Jun 2020, 19:39:04 UTC Just to share and to comment what happened to two failed CMS WUs. https://lhcathome.cern.ch/lhcathome/result.php?resultid=276498495 https://lhcathome.cern.ch/lhcathome/result.php?resultid=276571790 The VM has been being in "Running" state for 12 minutes (from 00:14:57 to 00:26:51). Then the VM has been being in "Paused" state for 18 hours (66684 seconds, from 00:26:51 to 18:38:07). After that, the VM has again been being in "Running" state for 12 minutes. At 18:50:23, the log shows the error: "Condor ended after 66684 second". I'm really surprised. 18 hours of "Paused" state, it's no sense. I explain me better. After 18 hours of inactivity some procedures could face some time-out issue. <core_client_version>7.16.7</core_client_version> <![CDATA[ <message> Funzione non corretta. (0x1) - exit code 1 (0x1)</message> <stderr_txt> 2020-06-07 00:14:39 (12788): Detected: vboxwrapper 26197 2020-06-07 00:14:39 (12788): Detected: BOINC client v7.7 2020-06-07 00:14:40 (12788): Detected: VirtualBox VboxManage Interface (Version: 5.2.8) ...omitted.... 2020-06-07 00:14:57 (12788): VM state change detected. (old = 'PoweredOff', new = 'Running') ...omitted.... 2020-06-07 00:19:02 (12788): Guest Log: [DEBUG] HTCondor ping 2020-06-07 00:19:03 (12788): Guest Log: [DEBUG] 0 2020-06-07 00:26:51 (12788): VM state change detected. (old = 'Running', new = 'Paused') 2020-06-07 18:38:07 (12788): VM state change detected. (old = 'Paused', new = 'Running') 2020-06-07 18:38:08 (12788): Guest Log: 00:10:40.055696 timesync vgsvcTimeSyncWorker: Radical host time change: 65 481 854 000 000ns (HostNow=1 591 547 868 663 000 000 ns HostLast=1 591 482 386 809 000 000 ns) 2020-06-07 18:38:18 (12788): Guest Log: 00:10:50.056476 timesync vgsvcTimeSyncWorker: Radical guest time change: 65 470 267 403 000ns (GuestNow=1 591 547 878 663 789 000 ns GuestLast=1 591 482 408 396 386 000 ns fSetTimeLastLoop=true ) 2020-06-07 18:50:23 (12788): Guest Log: [ERROR] Condor ended after 66684 seconds. 2020-06-07 18:50:23 (12788): Guest Log: [INFO] Shutting Down. ID: 42809 · Reply Quote

PaoloNasca Send message Joined: 11 Jul 19 Posts: 6 Credit: 1,622,748 RAC: 681	Message 42810 - Posted: 8 Jun 2020, 0:12:52 UTC Only one CMS WU ended successfully. I'd like to share a thought with Ivan and the developing team. The WU will successfully end If the Elapsed Time is less than Job Duration (64800 sec = 18 hours). My conclusion is: the VM has to be in “Running” state consecutively, I mean without any break/suspension. What do you think to build a VM with 2 or more CPUs? The Elapsed Time is inversely proportional to the CPU power. Here below an extract from https://lhcathome.cern.ch/lhcathome/result.php?resultid=276449837 ….omitted…. 2020-06-07 01:03:35 (12068): VM state change detected. (old = 'Running', new = 'Paused') 2020-06-07 01:08:35 (12068): VM state change detected. (old = 'Paused', new = 'Running') 2020-06-07 01:20:09 (12068): VM state change detected. (old = 'Running', new = 'Paused') 2020-06-07 01:29:16 (12068): VM state change detected. (old = 'Paused', new = 'Running') 2020-06-07 02:01:05 (12068): Status Report: Job Duration: '64800.000000' 2020-06-07 02:01:05 (12068): Status Report: Elapsed Time: '6000.992973' 2020-06-07 02:01:05 (12068): Status Report: CPU Time: '5208.796875' 2020-06-07 03:41:48 (12068): Status Report: Job Duration: '64800.000000' 2020-06-07 03:41:48 (12068): Status Report: Elapsed Time: '12000.992973' 2020-06-07 03:41:48 (12068): Status Report: CPU Time: '10897.078125' 2020-06-07 05:22:14 (12068): Status Report: Job Duration: '64800.000000' 2020-06-07 05:22:14 (12068): Status Report: Elapsed Time: '18000.992973' 2020-06-07 05:22:14 (12068): Status Report: CPU Time: '16861.296875' 2020-06-07 07:02:59 (12068): Status Report: Job Duration: '64800.000000' 2020-06-07 07:02:59 (12068): Status Report: Elapsed Time: '24000.992973' 2020-06-07 07:02:59 (12068): Status Report: CPU Time: '22545.656250' 2020-06-07 08:43:26 (12068): Status Report: Job Duration: '64800.000000' 2020-06-07 08:43:26 (12068): Status Report: Elapsed Time: '30001.494833' 2020-06-07 08:43:26 (12068): Status Report: CPU Time: '28544.906250' 2020-06-07 10:09:12 (12068): VM state change detected. (old = 'Running', new = 'Paused') 2020-06-07 10:16:02 (12068): VM state change detected. (old = 'Paused', new = 'Running') 2020-06-07 10:30:53 (12068): Status Report: Job Duration: '64800.000000' 2020-06-07 10:30:53 (12068): Status Report: Elapsed Time: '36001.494833' 2020-06-07 10:30:53 (12068): Status Report: CPU Time: '34246.296875' 2020-06-07 12:11:26 (12068): Status Report: Job Duration: '64800.000000' 2020-06-07 12:11:26 (12068): Status Report: Elapsed Time: '42001.494833' 2020-06-07 12:11:26 (12068): Status Report: CPU Time: '39958.921875' 2020-06-07 13:52:11 (12068): Status Report: Job Duration: '64800.000000' 2020-06-07 13:52:11 (12068): Status Report: Elapsed Time: '48001.494833' 2020-06-07 13:52:11 (12068): Status Report: CPU Time: '45994.109375' 2020-06-07 15:32:39 (12068): Status Report: Job Duration: '64800.000000' 2020-06-07 15:32:39 (12068): Status Report: Elapsed Time: '54001.494833' 2020-06-07 15:32:39 (12068): Status Report: CPU Time: '48189.859375' 2020-06-07 17:12:49 (12068): Status Report: Job Duration: '64800.000000' 2020-06-07 17:12:49 (12068): Status Report: Elapsed Time: '60001.494833' 2020-06-07 17:12:49 (12068): Status Report: CPU Time: '48252.187500' 2020-06-07 18:32:55 (12068): Powering off VM. ID: 42810 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1279 Credit: 8,482,293 RAC: 1,690	Message 42811 - Posted: 8 Jun 2020, 8:58:31 UTC - in response to Message 42810. What do you think to build a VM with 2 or more CPUs? The task will not run twice as fast. The CMS-VM is created in such a way, that when the VM has 2 cores, it will run 2 CMS-jobs inside the VM, so it will need as long as 1 single core VM doing 2 jobs on 2 cores. Did you have a look into the Console (ALT-F2) or Graphics (logfile) to see how many events were processed before the task was killed? ID: 42811 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2410 Credit: 225,987,175 RAC: 125,553	Message 42812 - Posted: 8 Jun 2020, 9:05:26 UTC - in response to Message 42810. What you see is the correct behavior. CMS tasks are an envelope to run "jobs", also called "subtasks". Typical runtimes per jobs are roughly 2.5-3 h. CMS tasks are designed to run 12h. Tasks hitting this limit will finish the running job and then finish itself. The 18h you see is a watchdog limit. Every task hitting this limit will shutdown regardless of the process state inside the VM. This avoids tasks running forever if something inside the VM is trapped in an endless loop. ID: 42812 · Reply Quote

PaoloNasca Send message Joined: 11 Jul 19 Posts: 6 Credit: 1,622,748 RAC: 681	Message 42813 - Posted: 8 Jun 2020, 9:43:57 UTC - in response to Message 42811. Did you have a look into the Console (ALT-F2) or Graphics (logfile) to see how many events were processed before the task was killed? I don’t know how to read what you asked me. Please give me more info and I’ll be very happy and proud to help the development team. Thanks for your time. ID: 42813 · Reply Quote

PaoloNasca Send message Joined: 11 Jul 19 Posts: 6 Credit: 1,622,748 RAC: 681	Message 42814 - Posted: 8 Jun 2020, 10:02:26 UTC - in response to Message 42812. CMS tasks are designed to run 12h. Tasks hitting this limit will finish the running job and then finish itself. The 18h you see is a watchdog limit. Are you meaning the CMS tasks need a minimum CPU power? I explain me better. The first fine ended CMS task lasted near 17 hours. I think my PC isn’t suitable to help CMS@home project. Your reply is really appreciated. Thanks ID: 42814 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1279 Credit: 8,482,293 RAC: 1,690	Message 42815 - Posted: 8 Jun 2020, 11:09:36 UTC - in response to Message 42813. Last modified: 8 Jun 2020, 11:11:13 UTC I don’t know how to read what you asked me. Please give me more info and I’ll be very happy and proud to help the development team. The current CMS-jobs running inside the VM have to process 10,000 events before the next job should start when there is time left before the first 12 hours elapsed. To see the progress you may use BOINC Manager. When highlighting a running CMS-task you see on the left below 'Commands' a button 'Show graphics'. That give you access to a logfile inside the VM. For the other option I mentioned (Console), you have to install VirtualBox Extention Pack of the same VirtualBox version that you are using. You can get it from virtualbox.org Btw: Give a CMS-task 100% of the CPU - No throttling ID: 42815 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1005 Credit: 6,271,748 RAC: 428	Message 42817 - Posted: 8 Jun 2020, 12:07:08 UTC - in response to Message 42810. What do you think to build a VM with 2 or more CPUs? The Elapsed Time is inversely proportional to the CPU power. . We've done that on CMS@Home-dev, but instead of running N CPUs on one task, they run N jobs -- unfortunately, not all CMS jobs can run thread-safely (a particular concern with my other hat on, as a developer on the Phase-2 Level-1 Track-Trigger project). In my experience, there was enough overheads (staggering start-up so as not to overload the internet connection, jobs finishing earlier leaving the CPU idle while the other job(s) complete, etc.) that the optimum number of CPUs was N=2. ID: 42817 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1005 Credit: 6,271,748 RAC: 428	Message 42839 - Posted: 11 Jun 2020, 22:30:53 UTC - in response to Message 42812. What you see is the correct behavior. CMS tasks are an envelope to run "jobs", also called "subtasks". Typical runtimes per jobs are roughly 2.5-3 h. CMS tasks are designed to run 12h. Tasks hitting this limit will finish the running job and then finish itself. The 18h you see is a watchdog limit. Every task hitting this limit will shutdown regardless of the process state inside the VM. This avoids tasks running forever if something inside the VM is trapped in an endless loop. There is another time-out to consider, which should only come into play if you suspend BOINC (presumably to shut down your computer -- if you do so, please make sure all BOINC programmes finish before switching off!). WMAgent kills a "running" job after 47 hours of "wall-time" (i.e. real-world elapsed time); if your task is suspended HTCondor and WMAgent don't know that and assume its job is still running. ID: 42839 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1005 Credit: 6,271,748 RAC: 428	Message 42840 - Posted: 11 Jun 2020, 22:33:26 UTC - in response to Message 42807. That looks like the behaviour I saw when I tried to run a task on my Windows 10 laptop at home, with my very limited network bandwidth (using a 4G WiFi modem in a mobile not-spot). Connection to CERN dropped out continually, and a couple of jobs just stopped partway through, but then a new job was obtained and started successfully. Is this (jobs stopping part-way through, but the task continuing) still being observed? I tried to provoke it last night but everything seemed to run tickety-boo. ID: 42840 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1005 Credit: 6,271,748 RAC: 428	Message 42841 - Posted: 12 Jun 2020, 1:27:41 UTC - in response to Message 42840. Last modified: 12 Jun 2020, 1:28:34 UTC That looks like the behaviour I saw when I tried to run a task on my Windows 10 laptop at home, with my very limited network bandwidth (using a 4G WiFi modem in a mobile not-spot). Connection to CERN dropped out continually, and a couple of jobs just stopped partway through, but then a new job was obtained and started successfully. Is this (jobs stopping part-way through, but the task continuing) still being observed? I tried to provoke it last night but everything seemed to run tickety-boo. Hah, it looks like I caught one tonight! I set up a script to capture the log files every five minutes, and the first job stopped at around 9,500 events before the task then started a new job successfully. It's way past my bedtime, I'l leave it until tomorrow to cross-reference the logs to try to find out what went awry. ID: 42841 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1005 Credit: 6,271,748 RAC: 428	Message 42849 - Posted: 12 Jun 2020, 15:22:13 UTC - in response to Message 42841. That looks like the behaviour I saw when I tried to run a task on my Windows 10 laptop at home, with my very limited network bandwidth (using a 4G WiFi modem in a mobile not-spot). Connection to CERN dropped out continually, and a couple of jobs just stopped partway through, but then a new job was obtained and started successfully. Is this (jobs stopping part-way through, but the task continuing) still being observed? I tried to provoke it last night but everything seemed to run tickety-boo. Hah, it looks like I caught one tonight! I set up a script to capture the log files every five minutes, and the first job stopped at around 9,500 events before the task then started a new job successfully. It's way past my bedtime, I'l leave it until tomorrow to cross-reference the logs to try to find out what went awry. OK, here's the analysis. Executive summary: jobs are aborted if loss of communication exceeds 20 minutes, but the task does not see this as a failure. The logs can be seen at https://www.brunel.ac.uk/~eesridr/CMSAtHome_13_23_56/logs/. The output from the jobs are seen in finished_[1-6].log and running.log (the last job to run). The StartLog, StarterLog and MasterLog are the respective condor logs. Note that finished_1.log is smaller than the others. This is because it only generated 9,358 events instead of the requested 10,000: Begin processing the 9358th record. Run 1, Event 30609358, LumiSection 61219 on stream 0 at 12-Jun-2020 00:10:26.443 UTC The BOINC report on this task is at https://lhcathome.cern.ch/lhcathome/result.php?resultid=277290914. It shows nothing exceptional for a task which completed normally. It can be seen in the condor logs that there are several breaks in communication back to CERN, e.g. 06/12/20 02:16:29 (pid:1519) condor_read() failed: recv(fd=10) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <137.138.52.94:4080>. 06/12/20 02:16:29 (pid:1519) IO: Failed to read packet header 06/12/20 02:16:29 (pid:1519) Lost connection to shadow, waiting 1200 secs for reconnect 06/12/20 02:17:36 (pid:1519) Accepted request to reconnect from <:0> 06/12/20 02:17:36 (pid:1519) Ignoring old shadow <137.138.52.94:4080?addrs=137.138.52.94-4080&alias=vocms0267.cern.ch&noUDP&sock=shadow_2424955_eb4b_15537> 06/12/20 02:17:36 (pid:1519) Communicating with shadow <137.138.52.94:4080?addrs=137.138.52.94 4080&alias=vocms0267.cern.ch&noUDP&sock=shadow_2424955_eb4b_15537> 06/12/20 02:17:36 (pid:1519) Recovered connection to shadow after 67 seconds I'm not sure how many of these network problems are due to my limited and patchy network connection at home, but similar problems have been seen elsewhere; I know that there are other volunteers who also have limited networking. In the case of the short-running job, the failure occurred about 00:10:26.443 UTC, or 01:10:26 local. In StarterLog we find: 06/12/20 00:50:26 (pid:20894) IO: Failed to read packet header 06/12/20 00:50:26 (pid:20894) Lost connection to shadow, waiting 1200 secs for reconnect 06/12/20 01:10:26 (pid:20894) No reconnect from shadow for 1200 seconds, aborting job execution! 06/12/20 01:10:26 (pid:20894) ShutdownFast all jobs. 06/12/20 01:10:26 (pid:20894) Process exited, pid=28703, signal=9 06/12/20 01:10:26 (pid:20894) Failed to send job exit status to shadow 06/12/20 01:10:26 (pid:20894) Last process exited, now Starter is exiting so this short job is caused by a failure to reconnect within 20 minutes, as surmised. ID: 42849 · Reply Quote

LHC@home