Thread 'CMS Tasks Failing'

Author	Message
computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2725 Credit: 300,167,102 RAC: 40,917	Message 34154 - Posted: 29 Jan 2018, 21:01:30 UTC - in response to Message 34150. ] hello Ivan, as the problems persist and everyday some tasks fail with 2018-01-28 12:26:46 (7296): Guest Log: 01/28/18 12:28:29 attempt to connect to <128.142.142.167:9618> failed: Connection timed out (connect errno = 110). 2018-01-28 12:26:46 (7296): Guest Log: ERROR: failed to make connection to <128.142.142.167:9618> 2018-01-28 12:26:47 (7296): Guest Log: [ERROR] Could not ping HTCondor. this afternoon, I had 4 task failing in a row :-([/quote] @ Erich56 Take a look into the error logs of your hosts. This examples are from: https://lhcathome.cern.ch/lhcathome/results.php?hostid=10388905&offset=0&show_names=0&state=6&appid=11 [pre]2018-01-29 16:29:41 (6808): Guest Log: [DEBUG] nc: connect to vccondor01.cern.ch port 9618 (tcp) timed out: Operation now in progress 2018-01-29 16:29:41 (6808): Guest Log: [DEBUG] 1 2018-01-29 16:29:41 (6808): Guest Log: [ERROR] Could not connect to Condor server on port 9618 2018-01-29 16:29:41 (3812): Guest Log: [DEBUG] nc: connect to vccondor01.cern.ch port 9618 (tcp) timed out: Operation now in progress 2018-01-29 16:29:42 (3812): Guest Log: [DEBUG] 1 2018-01-29 16:29:42 (3812): Guest Log: [ERROR] Could not connect to Condor server on port 9618 2018-01-29 16:29:43 (9328): Guest Log: [DEBUG] nc: connect to vccondor01.cern.ch port 9618 (tcp) timed out: Operation now in progress 2018-01-29 16:29:43 (9328): Guest Log: [DEBUG] 1 2018-01-29 16:29:43 (9328): Guest Log: [ERROR] Could not connect to Condor server on port 9618 2018-01-29 16:29:50 (8812): Guest Log: [DEBUG] nc: connect to vccondor01.cern.ch port 9618 (tcp) timed out: Operation now in progress 2018-01-29 16:29:50 (8812): Guest Log: [DEBUG] 1 2018-01-29 16:29:50 (8812): Guest Log: [ERROR] Could not connect to Condor server on port 9618[/pre] It seems that you started too much VMs concurrently and did not allow them to reach a stable phase after startup. After a VM start you may wait at least until a successful condor ping, better a bit longer, before you start the next VM. It would also be helpful not to saturate the host up to 100% CPU usage. A limit of 75-80% would probably be much more stable. ID: 34154 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 920 Credit: 779,576,362 RAC: 132,976	Message 34155 - Posted: 29 Jan 2018, 21:18:19 UTC - in response to Message 34133. I checked today I'm at similar %. Of these failed it's 75% Error 152 (Could not connect to Condor) or 206 can not ping condor. The a few where the VM could be started and one heartbeat missing. Just for reference: LHCb = 1.49% ATLAS = 1.23% Theory = 26.82% :( = Seems mostly as there was no work in the queue or same 152 errors. SixTrack = 0.19% :) ID: 34155 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1971 Credit: 159,508,085 RAC: 46,872	Message 34171 - Posted: 31 Jan 2018, 8:15:02 UTC - in response to Message 34154. ezrmle wrote: @ Erich56 Take a look into the error logs of your hosts. This examples are from: https://lhcathome.cern.ch/lhcathome/results.php?hostid=10388905&offset=0&show_names=0&state=6&appid=11 [pre]2018-01-29 16:29:41 (6808): Guest Log: [DEBUG] nc: connect to vccondor01.cern.ch port 9618 (tcp) timed out: Operation now in progress 2018-01-29 16:29:41 (6808): Guest Log: [DEBUG] 1 2018-01-29 16:29:41 (6808): Guest Log: [ERROR] Could not connect to Condor server on port 9618 ... It seems that you started too much VMs concurrently and did not allow them to reach a stable phase after startup. After a VM start you may wait at least until a successful condor ping, better a bit longer, before you start the next VM. It would also be helpful not to saturate the host up to 100% CPU usage. A limit of 75-80% would probably be much more stable. Thanks for your thoughts and hints! The way the VMs are being started I cannot easily influence. Once one of the running tasks gets finished and uploaded, the next one starts. So it happens that there is long time inbetween, but it can also happen that there is only short time inbetween. What concerns CPU usage, besides the 2 GPUGRID tasks running, I have 8 CMS tasks running. So the 12-core CPU is at a maximum use of about 86%, many times at about 77% (or even less) when one or more CMS tasks are uploading interim results and there is no CPU usage at that time. Well, what I could to is to reduce the number of concurrent CMS tasks by one or two, in order to see whether the Condor server problem then does not occur any more. ID: 34171 · Reply Quote

rbpeake Send message Joined: 17 Sep 04 Posts: 106 Credit: 36,549,147 RAC: 0	Message 38478 - Posted: 28 Mar 2019, 20:01:29 UTC Do the CMS units fail if BOINC is exited and started again? I had to reboot my computer today and yesterday. Thanks. https://lhcathome.cern.ch/lhcathome/results.php?userid=1080&offset=0&show_names=0&state=6&appid=11 Regards, Bob P. ID: 38478 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2725 Credit: 300,167,102 RAC: 40,917	Message 38481 - Posted: 28 Mar 2019, 20:58:49 UTC - in response to Message 38478. ]Do the CMS units fail if BOINC is exited and started again? I had to reboot my computer today and yesterday. Thanks. https://lhcathome.cern.ch/lhcathome/results.php?userid=1080&offset=0&show_names=0&state=6&appid=11[/quote] The link you placed in your post point to a page that is not allowed for other users. Links should point to a computer page, the task list of a specific computer or a specific task. I guess it was this task? https://lhcathome.cern.ch/lhcathome/result.php?resultid=220149107 There were 2 restarts within a few minutes. The VM recovered during the 1st restart but not during the 2nd. Most likely not enough time to update the heartbeat timestamp. This is why the VM finally failed. [pre]2019-03-28 13:50:51 (21584): Stopping VM. . . . 2019-03-28 13:54:38 (21228): Successfully started VM. (PID = '7684') 2019-03-28 13:54:38 (21228): Reporting VM Process ID to BOINC. 2019-03-28 13:54:38 (21228): VM state change detected. (old = 'PoweredOff', new = 'Running') . . . 2019-03-28 14:05:44 (21228): Stopping VM. . . . 2019-03-28 14:09:01 (11328): Successfully started VM. (PID = '13268') 2019-03-28 14:09:01 (11328): Reporting VM Process ID to BOINC. 2019-03-28 14:09:01 (11328): VM state change detected. (old = 'PoweredOff', new = 'Running') . . . 2019-03-28 14:49:04 (11328): VM Heartbeat file specified, but missing heartbeat. . . .[/pre] ID: 38481 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1971 Credit: 159,508,085 RAC: 46,872	Message 38482 - Posted: 28 Mar 2019, 20:59:12 UTC Bob, for some reason your link does not open (access denied), so I cannot see the details (I guess that you do not allow, in your web settings, other people to see your tasks). You might wish to change this setting. Re your question: the tasks normally do NOT fail if after exiting BOINC you let the VM shut down properly within a time period of about 1 or 2 minutes. Only thereafter, the computer should be shut off and switched on again. Try it this way, and it should work. ID: 38482 · Reply Quote

rbpeake Send message Joined: 17 Sep 04 Posts: 106 Credit: 36,549,147 RAC: 0	Message 38483 - Posted: 28 Mar 2019, 21:27:14 UTC - in response to Message 38481. Thanks! Regards, Bob P. ID: 38483 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2725 Credit: 300,167,102 RAC: 40,917	Message 38570 - Posted: 18 Apr 2019, 10:06:13 UTC https://lhcathome.cern.ch/lhcathome/cms_job.php Nothing but errors since yesterday. I set my hosts to NNT for CMS. ID: 38570 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1152 Credit: 11,734,920 RAC: 442	Message 38571 - Posted: 18 Apr 2019, 12:34:01 UTC - in response to Message 38570. https://lhcathome.cern.ch/lhcathome/cms_job.php Nothing but errors since yesterday. I set my hosts to NNT for CMS. Oops, thanks for reporting that. I'll try to work out what's wrong. ID: 38571 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1152 Credit: 11,734,920 RAC: 442	Message 38572 - Posted: 18 Apr 2019, 13:05:20 UTC - in response to Message 38571. Quick answer: T3_CH_Volunteer tasks and jobs seem to be running OK. The problem seems to be further down the line. https://lhcathomedev.cern.ch/lhcathome-dev/cms_job.php suggests that the post-processing steps are not being run on T3_CH_CMSAtHome, and the latest merged result file was written to EOS at 1830 CERN time last night. There have been some problems with the EOS file system in the last 24 hours, whether or not this is a consequence I cannot tell. The graphs suggest a Dashboard problem too. I'll send a few e-mails; my University has a week off over Easter, starting yesterday -- I'm not sure what holidays CERN has at this time. ID: 38572 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1152 Credit: 11,734,920 RAC: 442	Message 38573 - Posted: 18 Apr 2019, 13:39:24 UTC - in response to Message 38572. Last modified: 18 Apr 2019, 13:47:28 UTC Laurence reports that the T3_CH_CMSAtHome cluster seems fine, so we have a deeper problem than that. The jobs graphs suggest that there are stage-out problems writing results to the DataBridge at CERN, but if so the error is not being propagated to killing BOINC tasks (all my completed tasks show no errors). I'm having some trouble seeing recent result files on DataBridge, but that seems to date back to March 23rd so probably another problem completely. [Edit] Result files are being written to earlier subdirectories on DataBridge, why I have no idea. It makes it harder to determine when the files were written. [/Edit] ID: 38573 · Reply Quote

Jim1348 Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0	Message 38574 - Posted: 18 Apr 2019, 14:14:01 UTC On a more prosaic level, I have several CMS waiting to run. They were downloaded about a day ago. Should I hold them, or allow them to run? Thanks for your work. The problems have been remarkably few. ID: 38574 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1152 Credit: 11,734,920 RAC: 442	Message 38579 - Posted: 18 Apr 2019, 18:33:53 UTC - in response to Message 38574. On a more prosaic level, I have several CMS waiting to run. They were downloaded about a day ago. Should I hold them, or allow them to run? Thanks for your work. The problems have been remarkably few. As far as you are concerned, it probably makes no difference -- it seems that the BOINC tasks see completion and award credit. On the other hand, CMS is not seeing the results of any jobs, so from our point of view that's a negative. I'd be inclined to hold them (tho' I'm letting my own machines continue to run, purely for the purpose of gathering intelligence). I'm starting to see signs that perhaps a certificate has expired, leading to write permissions for the result and log files being denied. Given the time of year, that may not be fixed before next Tuesday at the earliest... (and I'm away at a conference Wed and Thurs, which will limit my involvement those two days as I currently only have a tablet, no laptop.) ID: 38579 · Reply Quote

Jim1348 Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0	Message 38580 - Posted: 18 Apr 2019, 18:57:58 UTC - in response to Message 38579. As far as you are concerned, it probably makes no difference -- it seems that the BOINC tasks see completion and award credit. On the other hand, CMS is not seeing the results of any jobs, so from our point of view that's a negative. I will put them on hold. (What is "credit"?). ID: 38580 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1152 Credit: 11,734,920 RAC: 442	Message 38584 - Posted: 19 Apr 2019, 11:20:11 UTC - in response to Message 38580. (What is "credit"?). For you, it's 7,247,535 :-) ID: 38584 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1971 Credit: 159,508,085 RAC: 46,872	Message 39154 - Posted: 19 Jun 2019, 16:39:16 UTC This afternoon, I've got quite a number of failures within about 45 seconds after start of a CMS task. Examples are: https://lhcathome.cern.ch/lhcathome/result.php?resultid=233584117, or https://lhcathome.cern.ch/lhcathome/result.php?resultid=233559195 excerpt from the stderr: Error in storage attach (fixed disk) for VM: -2135228409 ... VBoxManage.exe: error: Medium 'D:\BOINC DATA\slots\6\vm_image.vdi' is not accessible what's going wrong? ID: 39154 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1152 Credit: 11,734,920 RAC: 442	Message 39158 - Posted: 20 Jun 2019, 8:52:52 UTC - in response to Message 39154. This afternoon, I've got quite a number of failures within about 45 seconds after start of a CMS task. Examples are: https://lhcathome.cern.ch/lhcathome/result.php?resultid=233584117, or https://lhcathome.cern.ch/lhcathome/result.php?resultid=233559195 excerpt from the stderr: Error in storage attach (fixed disk) for VM: -2135228409 ... VBoxManage.exe: error: Medium 'D:\BOINC DATA\slots\6\vm_image.vdi' is not accessible what's going wrong? A bit further down: Another VirtualBox management application has locked the session for this VM. BOINC cannot properly monitor this VM and so this job will be aborted. 2019-06-19 06:42:11 (7116): Could not create VM 2019-06-19 06:42:11 (7116): ERROR: VM failed to start 2019-06-19 06:42:16 (7116): NOTE: VM session lock error encountered. BOINC will be notified that it needs to clean up the environment. This might be a temporary problem and so this job will be rescheduled for another time. Check that you don't have a stalled job somewhere. ID: 39158 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1152 Credit: 11,734,920 RAC: 442	Message 39172 - Posted: 23 Jun 2019, 18:11:19 UTC Hmm, something's gone wrong. A new batch of jobs that I submitted yesterday has not appeared. I've just sent in a second batch while I try to work out what the problem might be. ID: 39172 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1152 Credit: 11,734,920 RAC: 442	Message 39189 - Posted: 26 Jun 2019, 10:18:54 UTC - in response to Message 39172. Last modified: 26 Jun 2019, 10:20:51 UTC Hmm, something's gone wrong. A new batch of jobs that I submitted yesterday has not appeared. I've just sent in a second batch while I try to work out what the problem might be. OK, I tried submission again today and it failed because the taskname was > 50 characters! Now if this happened at the weekend I didn't notice, because the final message was: Injected 0 workflows out of 1 templates. Good job! That's easy to overlook... I modified the JSON script and a new submission worked. My machines are picking up jobs again. I've also modified the submission script so that it doesn't give the above message if no jobs submitted successfully. Of course, this was almost certainly due to changes that were made to the WMAgent software during interventions at the end of last week. ID: 39189 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2725 Credit: 300,167,102 RAC: 40,917	Message 39208 - Posted: 27 Jun 2019, 18:55:48 UTC Looks like my CMS task have (temporarily) problems to upload subtask results. In addition there's a huge red peak in the dashboard graphic: http://dashb-cms-jobsmry.cern.ch/dashboard/request.py/terminatedjobsstatus_individual?sites=T3_CH_Volunteer&sitesSort=3&start=null&end=null&timeRange=lastWeek&sortBy=0&granularity=Hourly&generic=0&series=All&type=nwcb ID: 39208 · Reply Quote