Message boards :
CMS Application :
CMS Tasks Failing
Message board moderation
Previous · 1 . . . 11 · 12 · 13 · 14 · 15 · 16 · 17 . . . 22 · Next
Author | Message |
---|---|
Send message Joined: 15 Jun 08 Posts: 2561 Credit: 256,894,612 RAC: 109,781 |
hello Ivan, as the problems persist and everyday some tasks fail withthis afternoon, I had 4 task failing in a row :-( @ Erich56 Take a look into the error logs of your hosts. This examples are from: https://lhcathome.cern.ch/lhcathome/results.php?hostid=10388905&offset=0&show_names=0&state=6&appid=11 2018-01-29 16:29:41 (6808): Guest Log: [DEBUG] nc: connect to vccondor01.cern.ch port 9618 (tcp) timed out: Operation now in progress 2018-01-29 16:29:41 (6808): Guest Log: [DEBUG] 1 2018-01-29 16:29:41 (6808): Guest Log: [ERROR] Could not connect to Condor server on port 9618 2018-01-29 16:29:41 (3812): Guest Log: [DEBUG] nc: connect to vccondor01.cern.ch port 9618 (tcp) timed out: Operation now in progress 2018-01-29 16:29:42 (3812): Guest Log: [DEBUG] 1 2018-01-29 16:29:42 (3812): Guest Log: [ERROR] Could not connect to Condor server on port 9618 2018-01-29 16:29:43 (9328): Guest Log: [DEBUG] nc: connect to vccondor01.cern.ch port 9618 (tcp) timed out: Operation now in progress 2018-01-29 16:29:43 (9328): Guest Log: [DEBUG] 1 2018-01-29 16:29:43 (9328): Guest Log: [ERROR] Could not connect to Condor server on port 9618 2018-01-29 16:29:50 (8812): Guest Log: [DEBUG] nc: connect to vccondor01.cern.ch port 9618 (tcp) timed out: Operation now in progress 2018-01-29 16:29:50 (8812): Guest Log: [DEBUG] 1 2018-01-29 16:29:50 (8812): Guest Log: [ERROR] Could not connect to Condor server on port 9618 It seems that you started too much VMs concurrently and did not allow them to reach a stable phase after startup. After a VM start you may wait at least until a successful condor ping, better a bit longer, before you start the next VM. It would also be helpful not to saturate the host up to 100% CPU usage. A limit of 75-80% would probably be much more stable. |
Send message Joined: 27 Sep 08 Posts: 853 Credit: 696,267,665 RAC: 140,503 |
I checked today I'm at similar %. Of these failed it's 75% Error 152 (Could not connect to Condor) or 206 can not ping condor. The a few where the VM could be started and one heartbeat missing. Just for reference: LHCb = 1.49% ATLAS = 1.23% Theory = 26.82% :( = Seems mostly as there was no work in the queue or same 152 errors. SixTrack = 0.19% :) |
Send message Joined: 18 Dec 15 Posts: 1835 Credit: 120,674,980 RAC: 77,863 |
computezrmle wrote: @ Erich56 Thanks for your thoughts and hints! The way the VMs are being started I cannot easily influence. Once one of the running tasks gets finished and uploaded, the next one starts. So it happens that there is long time inbetween, but it can also happen that there is only short time inbetween. What concerns CPU usage, besides the 2 GPUGRID tasks running, I have 8 CMS tasks running. So the 12-core CPU is at a maximum use of about 86%, many times at about 77% (or even less) when one or more CMS tasks are uploading interim results and there is no CPU usage at that time. Well, what I could to is to reduce the number of concurrent CMS tasks by one or two, in order to see whether the Condor server problem then does not occur any more. |
Send message Joined: 17 Sep 04 Posts: 105 Credit: 32,929,051 RAC: 5,002 |
Do the CMS units fail if BOINC is exited and started again? I had to reboot my computer today and yesterday. Thanks. https://lhcathome.cern.ch/lhcathome/results.php?userid=1080&offset=0&show_names=0&state=6&appid=11 Regards, Bob P. |
Send message Joined: 15 Jun 08 Posts: 2561 Credit: 256,894,612 RAC: 109,781 |
Do the CMS units fail if BOINC is exited and started again? I had to reboot my computer today and yesterday. The link you placed in your post point to a page that is not allowed for other users. Links should point to a computer page, the task list of a specific computer or a specific task. I guess it was this task? https://lhcathome.cern.ch/lhcathome/result.php?resultid=220149107 There were 2 restarts within a few minutes. The VM recovered during the 1st restart but not during the 2nd. Most likely not enough time to update the heartbeat timestamp. This is why the VM finally failed. 2019-03-28 13:50:51 (21584): Stopping VM. . . . 2019-03-28 13:54:38 (21228): Successfully started VM. (PID = '7684') 2019-03-28 13:54:38 (21228): Reporting VM Process ID to BOINC. 2019-03-28 13:54:38 (21228): VM state change detected. (old = 'PoweredOff', new = 'Running') . . . 2019-03-28 14:05:44 (21228): Stopping VM. . . . 2019-03-28 14:09:01 (11328): Successfully started VM. (PID = '13268') 2019-03-28 14:09:01 (11328): Reporting VM Process ID to BOINC. 2019-03-28 14:09:01 (11328): VM state change detected. (old = 'PoweredOff', new = 'Running') . . . 2019-03-28 14:49:04 (11328): VM Heartbeat file specified, but missing heartbeat. . . . |
Send message Joined: 18 Dec 15 Posts: 1835 Credit: 120,674,980 RAC: 77,863 |
Bob, for some reason your link does not open (access denied), so I cannot see the details (I guess that you do not allow, in your web settings, other people to see your tasks). You might wish to change this setting. Re your question: the tasks normally do NOT fail if after exiting BOINC you let the VM shut down properly within a time period of about 1 or 2 minutes. Only thereafter, the computer should be shut off and switched on again. Try it this way, and it should work. |
Send message Joined: 17 Sep 04 Posts: 105 Credit: 32,929,051 RAC: 5,002 |
Thanks! Regards, Bob P. |
Send message Joined: 15 Jun 08 Posts: 2561 Credit: 256,894,612 RAC: 109,781 |
https://lhcathome.cern.ch/lhcathome/cms_job.php Nothing but errors since yesterday. I set my hosts to NNT for CMS. |
Send message Joined: 29 Aug 05 Posts: 1065 Credit: 8,121,338 RAC: 14,467 |
https://lhcathome.cern.ch/lhcathome/cms_job.php Oops, thanks for reporting that. I'll try to work out what's wrong. |
Send message Joined: 29 Aug 05 Posts: 1065 Credit: 8,121,338 RAC: 14,467 |
Quick answer: T3_CH_Volunteer tasks and jobs seem to be running OK. The problem seems to be further down the line. https://lhcathomedev.cern.ch/lhcathome-dev/cms_job.php suggests that the post-processing steps are not being run on T3_CH_CMSAtHome, and the latest merged result file was written to EOS at 1830 CERN time last night. There have been some problems with the EOS file system in the last 24 hours, whether or not this is a consequence I cannot tell. The graphs suggest a Dashboard problem too. I'll send a few e-mails; my University has a week off over Easter, starting yesterday -- I'm not sure what holidays CERN has at this time. |
Send message Joined: 29 Aug 05 Posts: 1065 Credit: 8,121,338 RAC: 14,467 |
Laurence reports that the T3_CH_CMSAtHome cluster seems fine, so we have a deeper problem than that. The jobs graphs suggest that there are stage-out problems writing results to the DataBridge at CERN, but if so the error is not being propagated to killing BOINC tasks (all my completed tasks show no errors). I'm having some trouble seeing recent result files on DataBridge, but that seems to date back to March 23rd so probably another problem completely. [Edit] Result files are being written to earlier subdirectories on DataBridge, why I have no idea. It makes it harder to determine when the files were written. [/Edit] |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 |
On a more prosaic level, I have several CMS waiting to run. They were downloaded about a day ago. Should I hold them, or allow them to run? Thanks for your work. The problems have been remarkably few. |
Send message Joined: 29 Aug 05 Posts: 1065 Credit: 8,121,338 RAC: 14,467 |
On a more prosaic level, I have several CMS waiting to run. They were downloaded about a day ago. Should I hold them, or allow them to run? As far as you are concerned, it probably makes no difference -- it seems that the BOINC tasks see completion and award credit. On the other hand, CMS is not seeing the results of any jobs, so from our point of view that's a negative. I'd be inclined to hold them (tho' I'm letting my own machines continue to run, purely for the purpose of gathering intelligence). I'm starting to see signs that perhaps a certificate has expired, leading to write permissions for the result and log files being denied. Given the time of year, that may not be fixed before next Tuesday at the earliest... (and I'm away at a conference Wed and Thurs, which will limit my involvement those two days as I currently only have a tablet, no laptop.) |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 |
As far as you are concerned, it probably makes no difference -- it seems that the BOINC tasks see completion and award credit. On the other hand, CMS is not seeing the results of any jobs, so from our point of view that's a negative. I will put them on hold. (What is "credit"?). |
Send message Joined: 29 Aug 05 Posts: 1065 Credit: 8,121,338 RAC: 14,467 |
|
Send message Joined: 18 Dec 15 Posts: 1835 Credit: 120,674,980 RAC: 77,863 |
This afternoon, I've got quite a number of failures within about 45 seconds after start of a CMS task. Examples are: https://lhcathome.cern.ch/lhcathome/result.php?resultid=233584117, or https://lhcathome.cern.ch/lhcathome/result.php?resultid=233559195 excerpt from the stderr: Error in storage attach (fixed disk) for VM: -2135228409 ... VBoxManage.exe: error: Medium 'D:\BOINC DATA\slots\6\vm_image.vdi' is not accessible what's going wrong? |
Send message Joined: 29 Aug 05 Posts: 1065 Credit: 8,121,338 RAC: 14,467 |
This afternoon, I've got quite a number of failures within about 45 seconds after start of a CMS task. A bit further down: Another VirtualBox management application has locked the session for this VM. BOINC cannot properly monitor this VM and so this job will be aborted. 2019-06-19 06:42:11 (7116): Could not create VM 2019-06-19 06:42:11 (7116): ERROR: VM failed to start 2019-06-19 06:42:16 (7116): NOTE: VM session lock error encountered. BOINC will be notified that it needs to clean up the environment. This might be a temporary problem and so this job will be rescheduled for another time. Check that you don't have a stalled job somewhere. |
Send message Joined: 29 Aug 05 Posts: 1065 Credit: 8,121,338 RAC: 14,467 |
|
Send message Joined: 29 Aug 05 Posts: 1065 Credit: 8,121,338 RAC: 14,467 |
Hmm, something's gone wrong. A new batch of jobs that I submitted yesterday has not appeared. I've just sent in a second batch while I try to work out what the problem might be. OK, I tried submission again today and it failed because the taskname was > 50 characters! Now if this happened at the weekend I didn't notice, because the final message was: Injected 0 workflows out of 1 templates. Good job! That's easy to overlook... I modified the JSON script and a new submission worked. My machines are picking up jobs again. I've also modified the submission script so that it doesn't give the above message if no jobs submitted successfully. Of course, this was almost certainly due to changes that were made to the WMAgent software during interventions at the end of last week. |
Send message Joined: 15 Jun 08 Posts: 2561 Credit: 256,894,612 RAC: 109,781 |
Looks like my CMS task have (temporarily) problems to upload subtask results. In addition there's a huge red peak in the dashboard graphic: http://dashb-cms-jobsmry.cern.ch/dashboard/request.py/terminatedjobsstatus_individual?sites=T3_CH_Volunteer&sitesSort=3&start=null&end=null&timeRange=lastWeek&sortBy=0&granularity=Hourly&generic=0&series=All&type=nwcb |
©2025 CERN