Message boards : CMS Application : Please check your task times and your IPv6 connectivity
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 998
Credit: 6,264,307
RAC: 71
Message 43528 - Posted: 25 Oct 2020, 16:39:41 UTC

Since the last WMAgent update we have had a large number of job failures, as reported on the job graphs and the overall number of tasks running is higher than normal. However, the overall number of failures reported by WMStats is not unusual.
There are, though, a high number of jobs reporting as success with exit code 0 which are re-runs after an initial failure. Most of the jobs failures I see are code 8002 -- fatal exception. I can only look at a small number of these, but they all failed because of an inability to connect to the conditions database (frontier) servers via IPv6:
An exception of category 'StdException' occurred while
   [0] Constructing the EventProcessor
   [1] Constructing ESSource: class=PoolDBESSource label='GlobalTag'
Exception Message:
A std::exception was thrown.
Can not get data (Additional Information: [frontier.c:1159]: No more servers/proxies. Last error was: Request 131 on chan 1 failed at Sun Oct 25 10:00:01 2020: -9 [fn-socket.c:85]: network error on connect to 2606:4700:3032::681c:94c: Network is unreachable) ( CORAL : "coral::FrontierAccess::Statement::execute" from "CORAL/RelationalPlugins/frontier" )

Now, many of these failures are coming from the same users -- and indeed machines. Their task reports are characterised by only lasting 30-60 minutes and only consuming a few minutes of CPU; nevertheless, they are flagged as valid.
Properly running CMS tasks should run for 12-18 hours and calculate several of our jobs in that time.
My assumption is that the 8002 failures are resubmitted to another machine and properly run the second second time, getting a zero exit code (success)
I managed to contact one of the users who was getting many 8002 errors and he reported that his IPv6 connection had inadvertently been turned off at his router. However his jobs are still running short with little CPU usage, but about 50% of them now report an error while computing.
I'm still not sure exactly where the problem lies, but I would ask that you check your CMS task reports and note if many are running for less than 12 hours, with very low CPU usage. If you find that, please check that your router and firewall are letting through IPv6 requests on the ports listed in our FAQ.
Thanks.
ID: 43528 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,130,430
RAC: 104,897
Message 43529 - Posted: 25 Oct 2020, 17:54:25 UTC

Hello Ivan,
saw this task from last night. Stopped at changing from CEST to CET:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=288254714

IPv6 not found so long.
ID: 43529 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 674
Credit: 43,152,472
RAC: 15,698
Message 43530 - Posted: 25 Oct 2020, 19:01:14 UTC - in response to Message 43529.  

Hello Ivan,
saw this task from last night. Stopped at changing from CEST to CET:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=288254714

IPv6 not found so long.

I got a bunch of these as well. All CMS and Theory tasks that were running failed when Daylight Savings Time ended and computer moved clock one hour back. Failure was 'VM Heartbeat file specified, but missing heartbeat.' Atlas tasks seem to have survived.
ID: 43530 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 998
Credit: 6,264,307
RAC: 71
Message 43531 - Posted: 26 Oct 2020, 10:23:58 UTC - in response to Message 43530.  

Interesting. I would have hoped that such checks replied on the UNIX epoch rather than local time.
ID: 43531 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,130,430
RAC: 104,897
Message 43532 - Posted: 26 Oct 2020, 11:14:13 UTC - in response to Message 43531.  

Interesting. I would have hoped that such checks replied on the UNIX epoch rather than local time.

We are hoping this was the last change in the EU from CEST to CET.
No one need it in the future.
ID: 43532 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 43533 - Posted: 26 Oct 2020, 11:53:51 UTC - in response to Message 43532.  
Last modified: 26 Oct 2020, 15:46:33 UTC

We are hoping this was the last change in the EU from CEST to CET.

Unfortunately, European politicians cannot decide what will be the standard time in the future (CET or CEST),

so "Joe Bloggs" and "Fred Bloggs" have to switch their watches for the time being.

Btw: I removed the heartbeat mechanism check from my CMS-job xml-file, so I cannot be surprised anymore by a killed VM.
ID: 43533 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 998
Credit: 6,264,307
RAC: 71
Message 43534 - Posted: 26 Oct 2020, 19:22:43 UTC

My Italian colleague has checked some of the condor logs and finds that the vast majority (98%) of jobs with 8002 errors come from the two machines I have already identified, so this is not a general problem, and not really "ours" except that it makes our performance plots look bad.
ID: 43534 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1114
Credit: 49,502,974
RAC: 4,007
Message 43535 - Posted: 26 Oct 2020, 22:54:15 UTC - in response to Message 43534.  
Last modified: 26 Oct 2020, 22:57:44 UTC

We have had that time change problem here many times over the years.
(Fall back to Pacific Standard Time: November 1, 2020 for me)
(Sunday, October 25, 2020 for the Geneva time change)
ID: 43535 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,948,605
RAC: 137,172
Message 43543 - Posted: 1 Nov 2020, 18:27:11 UTC - in response to Message 43534.  

CMS queue is dry again.

It looks like computers suffering from those or other problems quickly eat up all available subtasks in the queues.
It also looks like the automatic host banishment implemented in BOINC is too slow to cover that.

The question is whether it is possible to manually exclude the most affected hosts from getting tasks until their volunteers contact the project admins or the forum.
ID: 43543 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 998
Credit: 6,264,307
RAC: 71
Message 43575 - Posted: 6 Nov 2020, 14:28:05 UTC - in response to Message 43543.  

CMS queue is dry again.

It looks like computers suffering from those or other problems quickly eat up all available subtasks in the queues.
It also looks like the automatic host banishment implemented in BOINC is too slow to cover that.

The question is whether it is possible to manually exclude the most affected hosts from getting tasks until their volunteers contact the project admins or the forum.

It shouldn't have been, at least from my monitoring. (I did have a monitoring problem over the weekend, turned out that my browser queue was corrupted -- clearing all history data eventually cleared it.)
I'm just waiting on Federica to analyse the latest condor logs to see if our "rogue" volunteer is still the main problem with jobs being flagged as failures in the graphs. Unfortunately I don't have BOINC admin rights at LHC@Home (I used to when CMS@Home was a separate entity). If we confirm that our Finnish friend is still the main culprit, I can ask Nils to email him directly (he's not answered my private message); failing that, I can then ask for him to be banned. Unfortunately he has a 64-processor computer and somehow manages to get over 150 tasks queued at any one time...
ID: 43575 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 674
Credit: 43,152,472
RAC: 15,698
Message 43576 - Posted: 6 Nov 2020, 14:43:50 UTC - in response to Message 43575.  

To get 150 CMS tasks is easy, all you have to do is enable 'If no work for selected applications is available, accept work from other applications?' and within an hour you get that amount despite of your cache size settings.
ID: 43576 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 998
Credit: 6,264,307
RAC: 71
Message 43578 - Posted: 6 Nov 2020, 16:06:24 UTC - in response to Message 43576.  

To get 150 CMS tasks is easy, all you have to do is enable 'If no work for selected applications is available, accept work from other applications?' and within an hour you get that amount despite of your cache size settings.

Ah, thanks; I wasn't aware of that loophole.
ID: 43578 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 674
Credit: 43,152,472
RAC: 15,698
Message 43581 - Posted: 6 Nov 2020, 18:59:42 UTC - in response to Message 43578.  

To get 150 CMS tasks is easy, all you have to do is enable 'If no work for selected applications is available, accept work from other applications?' and within an hour you get that amount despite of your cache size settings.

Ah, thanks; I wasn't aware of that loophole.

I haven't tested all the settings, but every time I enable CMS I get swamped with a lot of tasks.
ID: 43581 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 43583 - Posted: 7 Nov 2020, 9:50:33 UTC - in response to Message 43578.  

To get 150 CMS tasks is easy, all you have to do is enable 'If no work for selected applications is available, accept work from other applications?' and within an hour you get that amount despite of your cache size settings.

Ah, thanks; I wasn't aware of that loophole.
This is explainable BOINC behaviour.

On the 1st of November you had several short running tasks (700 seconds) after each other, exiting cause of NO_SUB_TASKS.
For BOINC they were OK and BOINC calculates that your machine is able to run that application within a certain amount of time.
When you have a big cache buffer to fill (days) and that application is available again, BOINC is expecting you can do that application in about 700 seconds and needs a lot of tasks to fill your cache.
ID: 43583 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 674
Credit: 43,152,472
RAC: 15,698
Message 43584 - Posted: 7 Nov 2020, 15:57:03 UTC - in response to Message 43583.  

I thought that tasks that error out are omitted from APR calculation?
ID: 43584 · Report as offensive     Reply Quote
Henry Nebrensky

Send message
Joined: 13 Jul 05
Posts: 165
Credit: 14,925,288
RAC: 34
Message 43587 - Posted: 8 Nov 2020, 20:08:54 UTC - in response to Message 43584.  
Last modified: 8 Nov 2020, 20:12:34 UTC

My 4-core box will readily get swamped by 100+ CMS tasks, to the extent that I will only tick the CMS box when down to the last few CMS tasks if I'm around to watch what happens and untick it in a hurry! This happens even when CMS is running fine - no correlation with job failures. Set to 0.5 days + 1.5 days. "accept work from other applications?" unticked. It might be trying to fill the local queue wrt the CMS deadline, but since letting it do this can force the other sub-projects off the machine I've never left it get that far.

My pet theory is that this is some side-effect of how Atlas and Theory artificially limit their task numbers based on core numbers (such that I can't get 2-day's-worth work between them). That, and the usual chaos one gets from trying to micro-manage the BOINC client (see above) rather than just let it get on with it.

p.s. This discussion should probably be in this thread?
ID: 43587 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 43594 - Posted: 12 Nov 2020, 19:10:12 UTC - in response to Message 43587.  

My 4-core box will readily get swamped by 100+ CMS tasks, to the extent that I will only tick the CMS box when down to the last few CMS tasks if I'm around to watch what happens and untick it in a hurry! This happens even when CMS is running fine - no correlation with job failures. Set to 0.5 days + 1.5 days. "accept work from other applications?" unticked. It might be trying to fill the local queue wrt the CMS deadline, but since letting it do this can force the other sub-projects off the machine I've never left it get that far.

That just happened to me on a newly-attached machine, with 0.5 + 0.5 days buffer setting.
It ran fine for a few days, and then went bonkers. It stopped after filling the buffer with 7 days of work, and I have set NNW.
https://lhcathome.cern.ch/lhcathome/results.php?hostid=10671473

I have noticed it on various projects (including WCG) ever since the recent versions of BOINC came out. They made a change to the scheduler (always a dangerous thing to do), and no one knows what is happening.
ID: 43594 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 43595 - Posted: 12 Nov 2020, 20:51:25 UTC - in response to Message 43594.  

I have noticed it on various projects (including WCG) ever since the recent versions of BOINC came out. They made a change to the scheduler (always a dangerous thing to do), and no one knows what is happening.
The same happened to me on PrimeGrid on 23 Oct. using BOINC version 7.16.11
Quoting myself from PrimeGrid's message board:
I'm responsible for 171 aborted tasks.
In spite of having the cache buffer set to 0.04 + 0.01 additional, tasks kept on flowing to one of my machines. No idea why.
From the about 750 tasks, I crunched the certificate ones (about 375) and aborted 171 from the normal long runners (more aborts to come tomorrow).
ID: 43595 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 998
Credit: 6,264,307
RAC: 71
Message 43613 - Posted: 15 Nov 2020, 19:54:58 UTC - in response to Message 43587.  

My 4-core box will readily get swamped by 100+ CMS tasks, to the extent that I will only tick the CMS box when down to the last few CMS tasks if I'm around to watch what happens and untick it in a hurry! This happens even when CMS is running fine - no correlation with job failures. Set to 0.5 days + 1.5 days. "accept work from other applications?" unticked. It might be trying to fill the local queue wrt the CMS deadline, but since letting it do this can force the other sub-projects off the machine I've never left it get that far.

My pet theory is that this is some side-effect of how Atlas and Theory artificially limit their task numbers based on core numbers (such that I can't get 2-day's-worth work between them). That, and the usual chaos one gets from trying to micro-manage the BOINC client (see above) rather than just let it get on with it.

p.s. This discussion should probably be in this thread?

This is all a bit strange to me. As far as I recall I've never got more CMS tasks than the physical number of CPUs in my machines -- I have one 4-core machine set to work preferences of Max 6 tasks, Max 6 CPUs; it only ever gets 4 tasks. 6-core and 40 core "work" servers get no more than 6 tasks at any one time.
Perhaps something's changed since I stopped running them intensively during lockdown(s).
ID: 43613 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 674
Credit: 43,152,472
RAC: 15,698
Message 43615 - Posted: 15 Nov 2020, 21:11:29 UTC - in response to Message 43613.  
Last modified: 15 Nov 2020, 21:13:18 UTC

My 4-core box will readily get swamped by 100+ CMS tasks, to the extent that I will only tick the CMS box when down to the last few CMS tasks if I'm around to watch what happens and untick it in a hurry! This happens even when CMS is running fine - no correlation with job failures. Set to 0.5 days + 1.5 days. "accept work from other applications?" unticked. It might be trying to fill the local queue wrt the CMS deadline, but since letting it do this can force the other sub-projects off the machine I've never left it get that far.

My pet theory is that this is some side-effect of how Atlas and Theory artificially limit their task numbers based on core numbers (such that I can't get 2-day's-worth work between them). That, and the usual chaos one gets from trying to micro-manage the BOINC client (see above) rather than just let it get on with it.

p.s. This discussion should probably be in this thread?

This is all a bit strange to me. As far as I recall I've never got more CMS tasks than the physical number of CPUs in my machines -- I have one 4-core machine set to work preferences of Max 6 tasks, Max 6 CPUs; it only ever gets 4 tasks. 6-core and 40 core "work" servers get no more than 6 tasks at any one time.
Perhaps something's changed since I stopped running them intensively during lockdown(s).

Could you try how many you get if you don't limit the number of tasks? Those of us that have more than 8 cores the 8 max tasks is not enough to keep our machines fully loaded. Atlas and Theory limits my tasks to 8+8 even with 'No limit' setting for number of tasks but CMS seems different.
ID: 43615 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : CMS Application : Please check your task times and your IPv6 connectivity


©2024 CERN