Message boards : CMS Application : CMS Tasks Failing
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 11 · 12 · 13 · 14 · 15 · 16 · 17 . . . 22 · Next

AuthorMessage
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2401
Credit: 225,517,448
RAC: 124,814
Message 34154 - Posted: 29 Jan 2018, 21:01:30 UTC - in response to Message 34150.  

hello Ivan, as the problems persist and everyday some tasks fail with

2018-01-28 12:26:46 (7296): Guest Log: 01/28/18 12:28:29 attempt to connect to <128.142.142.167:9618> failed: Connection timed out (connect errno = 110).
2018-01-28 12:26:46 (7296): Guest Log: ERROR: failed to make connection to <128.142.142.167:9618>
2018-01-28 12:26:47 (7296): Guest Log: [ERROR] Could not ping HTCondor.
this afternoon, I had 4 task failing in a row :-(

@ Erich56

Take a look into the error logs of your hosts.

This examples are from:
https://lhcathome.cern.ch/lhcathome/results.php?hostid=10388905&offset=0&show_names=0&state=6&appid=11
2018-01-29 16:29:41 (6808): Guest Log: [DEBUG] nc: connect to vccondor01.cern.ch port 9618 (tcp) timed out: Operation now in progress
2018-01-29 16:29:41 (6808): Guest Log: [DEBUG] 1
2018-01-29 16:29:41 (6808): Guest Log: [ERROR] Could not connect to Condor server on port 9618



2018-01-29 16:29:41 (3812): Guest Log: [DEBUG] nc: connect to vccondor01.cern.ch port 9618 (tcp) timed out: Operation now in progress
2018-01-29 16:29:42 (3812): Guest Log: [DEBUG] 1
2018-01-29 16:29:42 (3812): Guest Log: [ERROR] Could not connect to Condor server on port 9618



2018-01-29 16:29:43 (9328): Guest Log: [DEBUG] nc: connect to vccondor01.cern.ch port 9618 (tcp) timed out: Operation now in progress
2018-01-29 16:29:43 (9328): Guest Log: [DEBUG] 1
2018-01-29 16:29:43 (9328): Guest Log: [ERROR] Could not connect to Condor server on port 9618



2018-01-29 16:29:50 (8812): Guest Log: [DEBUG] nc: connect to vccondor01.cern.ch port 9618 (tcp) timed out: Operation now in progress
2018-01-29 16:29:50 (8812): Guest Log: [DEBUG] 1
2018-01-29 16:29:50 (8812): Guest Log: [ERROR] Could not connect to Condor server on port 9618



It seems that you started too much VMs concurrently and did not allow them to reach a stable phase after startup.
After a VM start you may wait at least until a successful condor ping, better a bit longer, before you start the next VM.
It would also be helpful not to saturate the host up to 100% CPU usage. A limit of 75-80% would probably be much more stable.
ID: 34154 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 804
Credit: 650,193,811
RAC: 247,597
Message 34155 - Posted: 29 Jan 2018, 21:18:19 UTC - in response to Message 34133.  

I checked today I'm at similar %. Of these failed it's 75% Error 152 (Could not connect to Condor) or 206 can not ping condor.

The a few where the VM could be started and one heartbeat missing.

Just for reference:

LHCb = 1.49%
ATLAS = 1.23%
Theory = 26.82% :( = Seems mostly as there was no work in the queue or same 152 errors.
SixTrack = 0.19% :)
ID: 34155 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1687
Credit: 103,106,842
RAC: 127,152
Message 34171 - Posted: 31 Jan 2018, 8:15:02 UTC - in response to Message 34154.  

computezrmle wrote:
@ Erich56

Take a look into the error logs of your hosts.

This examples are from:
https://lhcathome.cern.ch/lhcathome/results.php?hostid=10388905&offset=0&show_names=0&state=6&appid=11
[pre]2018-01-29 16:29:41 (6808): Guest Log: [DEBUG] nc: connect to vccondor01.cern.ch port 9618 (tcp) timed out: Operation now in progress
2018-01-29 16:29:41 (6808): Guest Log: [DEBUG] 1
2018-01-29 16:29:41 (6808): Guest Log: [ERROR] Could not connect to Condor server on port 9618
...
It seems that you started too much VMs concurrently and did not allow them to reach a stable phase after startup.
After a VM start you may wait at least until a successful condor ping, better a bit longer, before you start the next VM.
It would also be helpful not to saturate the host up to 100% CPU usage. A limit of 75-80% would probably be much more stable.

Thanks for your thoughts and hints!

The way the VMs are being started I cannot easily influence. Once one of the running tasks gets finished and uploaded, the next one starts. So it happens that there is long time inbetween, but it can also happen that there is only short time inbetween.

What concerns CPU usage, besides the 2 GPUGRID tasks running, I have 8 CMS tasks running. So the 12-core CPU is at a maximum use of about 86%, many times at about 77% (or even less) when one or more CMS tasks are uploading interim results and there is no CPU usage at that time.

Well, what I could to is to reduce the number of concurrent CMS tasks by one or two, in order to see whether the Condor server problem then does not occur any more.
ID: 34171 · Report as offensive     Reply Quote
Profile rbpeake

Send message
Joined: 17 Sep 04
Posts: 99
Credit: 30,646,329
RAC: 2,238
Message 38478 - Posted: 28 Mar 2019, 20:01:29 UTC

Do the CMS units fail if BOINC is exited and started again? I had to reboot my computer today and yesterday.
Thanks.
https://lhcathome.cern.ch/lhcathome/results.php?userid=1080&offset=0&show_names=0&state=6&appid=11
Regards,
Bob P.
ID: 38478 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2401
Credit: 225,517,448
RAC: 124,814
Message 38481 - Posted: 28 Mar 2019, 20:58:49 UTC - in response to Message 38478.  

Do the CMS units fail if BOINC is exited and started again? I had to reboot my computer today and yesterday.
Thanks.
https://lhcathome.cern.ch/lhcathome/results.php?userid=1080&offset=0&show_names=0&state=6&appid=11

The link you placed in your post point to a page that is not allowed for other users.
Links should point to a computer page, the task list of a specific computer or a specific task.


I guess it was this task?
https://lhcathome.cern.ch/lhcathome/result.php?resultid=220149107


There were 2 restarts within a few minutes.
The VM recovered during the 1st restart but not during the 2nd.
Most likely not enough time to update the heartbeat timestamp.
This is why the VM finally failed.
2019-03-28 13:50:51 (21584): Stopping VM.
.
.
.
2019-03-28 13:54:38 (21228): Successfully started VM. (PID = '7684')
2019-03-28 13:54:38 (21228): Reporting VM Process ID to BOINC.
2019-03-28 13:54:38 (21228): VM state change detected. (old = 'PoweredOff', new = 'Running')
.
.
.
2019-03-28 14:05:44 (21228): Stopping VM.
.
.
.
2019-03-28 14:09:01 (11328): Successfully started VM. (PID = '13268')
2019-03-28 14:09:01 (11328): Reporting VM Process ID to BOINC.
2019-03-28 14:09:01 (11328): VM state change detected. (old = 'PoweredOff', new = 'Running')
.
.
.
2019-03-28 14:49:04 (11328): VM Heartbeat file specified, but missing heartbeat.
.
.
.
ID: 38481 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1687
Credit: 103,106,842
RAC: 127,152
Message 38482 - Posted: 28 Mar 2019, 20:59:12 UTC

Bob, for some reason your link does not open (access denied), so I cannot see the details (I guess that you do not allow, in your web settings, other people to see your tasks). You might wish to change this setting.

Re your question: the tasks normally do NOT fail if after exiting BOINC you let the VM shut down properly within a time period of about 1 or 2 minutes. Only thereafter, the computer should be shut off and switched on again.
Try it this way, and it should work.
ID: 38482 · Report as offensive     Reply Quote
Profile rbpeake

Send message
Joined: 17 Sep 04
Posts: 99
Credit: 30,646,329
RAC: 2,238
Message 38483 - Posted: 28 Mar 2019, 21:27:14 UTC - in response to Message 38481.  

Thanks!
Regards,
Bob P.
ID: 38483 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2401
Credit: 225,517,448
RAC: 124,814
Message 38570 - Posted: 18 Apr 2019, 10:06:13 UTC

https://lhcathome.cern.ch/lhcathome/cms_job.php
Nothing but errors since yesterday.
I set my hosts to NNT for CMS.
ID: 38570 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1005
Credit: 6,269,877
RAC: 404
Message 38571 - Posted: 18 Apr 2019, 12:34:01 UTC - in response to Message 38570.  

https://lhcathome.cern.ch/lhcathome/cms_job.php
Nothing but errors since yesterday.
I set my hosts to NNT for CMS.

Oops, thanks for reporting that. I'll try to work out what's wrong.
ID: 38571 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1005
Credit: 6,269,877
RAC: 404
Message 38572 - Posted: 18 Apr 2019, 13:05:20 UTC - in response to Message 38571.  

Quick answer: T3_CH_Volunteer tasks and jobs seem to be running OK. The problem seems to be further down the line. https://lhcathomedev.cern.ch/lhcathome-dev/cms_job.php suggests that the post-processing steps are not being run on T3_CH_CMSAtHome, and the latest merged result file was written to EOS at 1830 CERN time last night. There have been some problems with the EOS file system in the last 24 hours, whether or not this is a consequence I cannot tell. The graphs suggest a Dashboard problem too.
I'll send a few e-mails; my University has a week off over Easter, starting yesterday -- I'm not sure what holidays CERN has at this time.
ID: 38572 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1005
Credit: 6,269,877
RAC: 404
Message 38573 - Posted: 18 Apr 2019, 13:39:24 UTC - in response to Message 38572.  
Last modified: 18 Apr 2019, 13:47:28 UTC

Laurence reports that the T3_CH_CMSAtHome cluster seems fine, so we have a deeper problem than that. The jobs graphs suggest that there are stage-out problems writing results to the DataBridge at CERN, but if so the error is not being propagated to killing BOINC tasks (all my completed tasks show no errors). I'm having some trouble seeing recent result files on DataBridge, but that seems to date back to March 23rd so probably another problem completely.
[Edit] Result files are being written to earlier subdirectories on DataBridge, why I have no idea. It makes it harder to determine when the files were written. [/Edit]
ID: 38573 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 38574 - Posted: 18 Apr 2019, 14:14:01 UTC

On a more prosaic level, I have several CMS waiting to run. They were downloaded about a day ago. Should I hold them, or allow them to run?
Thanks for your work. The problems have been remarkably few.
ID: 38574 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1005
Credit: 6,269,877
RAC: 404
Message 38579 - Posted: 18 Apr 2019, 18:33:53 UTC - in response to Message 38574.  

On a more prosaic level, I have several CMS waiting to run. They were downloaded about a day ago. Should I hold them, or allow them to run?
Thanks for your work. The problems have been remarkably few.

As far as you are concerned, it probably makes no difference -- it seems that the BOINC tasks see completion and award credit. On the other hand, CMS is not seeing the results of any jobs, so from our point of view that's a negative. I'd be inclined to hold them (tho' I'm letting my own machines continue to run, purely for the purpose of gathering intelligence). I'm starting to see signs that perhaps a certificate has expired, leading to write permissions for the result and log files being denied. Given the time of year, that may not be fixed before next Tuesday at the earliest... (and I'm away at a conference Wed and Thurs, which will limit my involvement those two days as I currently only have a tablet, no laptop.)
ID: 38579 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 38580 - Posted: 18 Apr 2019, 18:57:58 UTC - in response to Message 38579.  

As far as you are concerned, it probably makes no difference -- it seems that the BOINC tasks see completion and award credit. On the other hand, CMS is not seeing the results of any jobs, so from our point of view that's a negative.

I will put them on hold.
(What is "credit"?).
ID: 38580 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1005
Credit: 6,269,877
RAC: 404
Message 38584 - Posted: 19 Apr 2019, 11:20:11 UTC - in response to Message 38580.  


(What is "credit"?).

For you, it's 7,247,535 :-)
ID: 38584 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1687
Credit: 103,106,842
RAC: 127,152
Message 39154 - Posted: 19 Jun 2019, 16:39:16 UTC

This afternoon, I've got quite a number of failures within about 45 seconds after start of a CMS task.

Examples are:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=233584117, or
https://lhcathome.cern.ch/lhcathome/result.php?resultid=233559195

excerpt from the stderr:
Error in storage attach (fixed disk) for VM: -2135228409
...
VBoxManage.exe: error: Medium 'D:\BOINC DATA\slots\6\vm_image.vdi' is not accessible

what's going wrong?
ID: 39154 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1005
Credit: 6,269,877
RAC: 404
Message 39158 - Posted: 20 Jun 2019, 8:52:52 UTC - in response to Message 39154.  

This afternoon, I've got quite a number of failures within about 45 seconds after start of a CMS task.

Examples are:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=233584117, or
https://lhcathome.cern.ch/lhcathome/result.php?resultid=233559195

excerpt from the stderr:
Error in storage attach (fixed disk) for VM: -2135228409
...
VBoxManage.exe: error: Medium 'D:\BOINC DATA\slots\6\vm_image.vdi' is not accessible

what's going wrong?

A bit further down:
Another VirtualBox management application has locked the session for this VM. BOINC cannot properly monitor this VM and so this job will be aborted.
2019-06-19 06:42:11 (7116): Could not create VM
2019-06-19 06:42:11 (7116): ERROR: VM failed to start
2019-06-19 06:42:16 (7116):
NOTE: VM session lock error encountered.
BOINC will be notified that it needs to clean up the environment.
This might be a temporary problem and so this job will be rescheduled for another time.


Check that you don't have a stalled job somewhere.
ID: 39158 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1005
Credit: 6,269,877
RAC: 404
Message 39172 - Posted: 23 Jun 2019, 18:11:19 UTC

Hmm, something's gone wrong. A new batch of jobs that I submitted yesterday has not appeared. I've just sent in a second batch while I try to work out what the problem might be.
ID: 39172 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1005
Credit: 6,269,877
RAC: 404
Message 39189 - Posted: 26 Jun 2019, 10:18:54 UTC - in response to Message 39172.  
Last modified: 26 Jun 2019, 10:20:51 UTC

Hmm, something's gone wrong. A new batch of jobs that I submitted yesterday has not appeared. I've just sent in a second batch while I try to work out what the problem might be.

OK, I tried submission again today and it failed because the taskname was > 50 characters! Now if this happened at the weekend I didn't notice, because the final message was:
Injected 0 workflows out of 1 templates. Good job!
That's easy to overlook...
I modified the JSON script and a new submission worked. My machines are picking up jobs again.
I've also modified the submission script so that it doesn't give the above message if no jobs submitted successfully.
Of course, this was almost certainly due to changes that were made to the WMAgent software during interventions at the end of last week.
ID: 39189 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2401
Credit: 225,517,448
RAC: 124,814
Message 39208 - Posted: 27 Jun 2019, 18:55:48 UTC

ID: 39208 · Report as offensive     Reply Quote
Previous · 1 . . . 11 · 12 · 13 · 14 · 15 · 16 · 17 . . . 22 · Next

Message boards : CMS Application : CMS Tasks Failing


©2024 CERN