Thread 'Please check your task times and your IPv6 connectivity'

Author	Message
ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1159 Credit: 11,871,672 RAC: 7,410	Message 43528 - Posted: 25 Oct 2020, 16:39:41 UTC Since the last WMAgent update we have had a large number of job failures, as reported on the job graphs and the overall number of tasks running is higher than normal. However, the overall number of failures reported by WMStats is not unusual. There are, though, a high number of jobs reporting as success with exit code 0 which are re-runs after an initial failure. Most of the jobs failures I see are code 8002 -- fatal exception. I can only look at a small number of these, but they all failed because of an inability to connect to the conditions database (frontier) servers via IPv6: An exception of category 'StdException' occurred while [0] Constructing the EventProcessor [1] Constructing ESSource: class=PoolDBESSource label='GlobalTag' Exception Message: A std::exception was thrown. Can not get data (Additional Information: [frontier.c:1159]: No more servers/proxies. Last error was: Request 131 on chan 1 failed at Sun Oct 25 10:00:01 2020: -9 [fn-socket.c:85]: network error on connect to 2606:4700:3032::681c:94c: Network is unreachable) ( CORAL : "coral::FrontierAccess::Statement::execute" from "CORAL/RelationalPlugins/frontier" ) Now, many of these failures are coming from the same users -- and indeed machines. Their task reports are characterised by only lasting 30-60 minutes and only consuming a few minutes of CPU; nevertheless, they are flagged as valid. Properly running CMS tasks should run for 12-18 hours and calculate several of our jobs in that time. My assumption is that the 8002 failures are resubmitted to another machine and properly run the second second time, getting a zero exit code (success) I managed to contact one of the users who was getting many 8002 errors and he reported that his IPv6 connection had inadvertently been turned off at his router. However his jobs are still running short with little CPU usage, but about 50% of them now report an error while computing. I'm still not sure exactly where the problem lies, but I would ask that you check your CMS task reports and note if many are running for less than 12 hours, with very low CPU usage. If you find that, please check that your router and firewall are letting through IPv6 requests on the ports listed in our FAQ. Thanks. ID: 43528 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2304 Credit: 179,727,092 RAC: 17,509	Message 43529 - Posted: 25 Oct 2020, 17:54:25 UTC Hello Ivan, saw this task from last night. Stopped at changing from CEST to CET: https://lhcathome.cern.ch/lhcathome/result.php?resultid=288254714 IPv6 not found so long. ID: 43529 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 806 Credit: 66,047,456 RAC: 27,780	Message 43530 - Posted: 25 Oct 2020, 19:01:14 UTC - in response to Message 43529. Hello Ivan, saw this task from last night. Stopped at changing from CEST to CET: https://lhcathome.cern.ch/lhcathome/result.php?resultid=288254714 IPv6 not found so long. I got a bunch of these as well. All CMS and Theory tasks that were running failed when Daylight Savings Time ended and computer moved clock one hour back. Failure was 'VM Heartbeat file specified, but missing heartbeat.' Atlas tasks seem to have survived. ID: 43530 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1159 Credit: 11,871,672 RAC: 7,410	Message 43531 - Posted: 26 Oct 2020, 10:23:58 UTC - in response to Message 43530. Interesting. I would have hoped that such checks replied on the UNIX epoch rather than local time. ID: 43531 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2304 Credit: 179,727,092 RAC: 17,509	Message 43532 - Posted: 26 Oct 2020, 11:14:13 UTC - in response to Message 43531. Interesting. I would have hoped that such checks replied on the UNIX epoch rather than local time. We are hoping this was the last change in the EU from CEST to CET. No one need it in the future. ID: 43532 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1556 Credit: 10,101,515 RAC: 1,464	Message 43533 - Posted: 26 Oct 2020, 11:53:51 UTC - in response to Message 43532. Last modified: 26 Oct 2020, 15:46:33 UTC We are hoping this was the last change in the EU from CEST to CET. Unfortunately, European politicians cannot decide what will be the standard time in the future (CET or CEST), so "Joe Bloggs" and "Fred Bloggs" have to switch their watches for the time being. Btw: I removed the heartbeat mechanism check from my CMS-job xml-file, so I cannot be surprised anymore by a killed VM. ID: 43533 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1159 Credit: 11,871,672 RAC: 7,410	Message 43534 - Posted: 26 Oct 2020, 19:22:43 UTC My Italian colleague has checked some of the condor logs and finds that the vast majority (98%) of jobs with 8002 errors come from the two machines I have already identified, so this is not a general problem, and not really "ours" except that it makes our performance plots look bad. ID: 43534 · Reply Quote

Magic Quantum Mechanic Send message Joined: 24 Oct 04 Posts: 1312 Credit: 97,694,594 RAC: 106,766	Message 43535 - Posted: 26 Oct 2020, 22:54:15 UTC - in response to Message 43534. Last modified: 26 Oct 2020, 22:57:44 UTC We have had that time change problem here many times over the years. (Fall back to Pacific Standard Time: November 1, 2020 for me) (Sunday, October 25, 2020 for the Geneva time change) ID: 43535 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2755 Credit: 304,271,457 RAC: 116,232	Message 43543 - Posted: 1 Nov 2020, 18:27:11 UTC - in response to Message 43534. CMS queue is dry again. It looks like computers suffering from those or other problems quickly eat up all available subtasks in the queues. It also looks like the automatic host banishment implemented in BOINC is too slow to cover that. The question is whether it is possible to manually exclude the most affected hosts from getting tasks until their volunteers contact the project admins or the forum. ID: 43543 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1159 Credit: 11,871,672 RAC: 7,410	Message 43575 - Posted: 6 Nov 2020, 14:28:05 UTC - in response to Message 43543. CMS queue is dry again. It looks like computers suffering from those or other problems quickly eat up all available subtasks in the queues. It also looks like the automatic host banishment implemented in BOINC is too slow to cover that. The question is whether it is possible to manually exclude the most affected hosts from getting tasks until their volunteers contact the project admins or the forum. It shouldn't have been, at least from my monitoring. (I did have a monitoring problem over the weekend, turned out that my browser queue was corrupted -- clearing all history data eventually cleared it.) I'm just waiting on Federica to analyse the latest condor logs to see if our "rogue" volunteer is still the main problem with jobs being flagged as failures in the graphs. Unfortunately I don't have BOINC admin rights at LHC@Home (I used to when CMS@Home was a separate entity). If we confirm that our Finnish friend is still the main culprit, I can ask Nils to email him directly (he's not answered my private message); failing that, I can then ask for him to be banned. Unfortunately he has a 64-processor computer and somehow manages to get over 150 tasks queued at any one time... ID: 43575 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 806 Credit: 66,047,456 RAC: 27,780	Message 43576 - Posted: 6 Nov 2020, 14:43:50 UTC - in response to Message 43575. To get 150 CMS tasks is easy, all you have to do is enable 'If no work for selected applications is available, accept work from other applications?' and within an hour you get that amount despite of your cache size settings. ID: 43576 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1159 Credit: 11,871,672 RAC: 7,410	Message 43578 - Posted: 6 Nov 2020, 16:06:24 UTC - in response to Message 43576. To get 150 CMS tasks is easy, all you have to do is enable 'If no work for selected applications is available, accept work from other applications?' and within an hour you get that amount despite of your cache size settings. Ah, thanks; I wasn't aware of that loophole. ID: 43578 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 806 Credit: 66,047,456 RAC: 27,780	Message 43581 - Posted: 6 Nov 2020, 18:59:42 UTC - in response to Message 43578. To get 150 CMS tasks is easy, all you have to do is enable 'If no work for selected applications is available, accept work from other applications?' and within an hour you get that amount despite of your cache size settings. Ah, thanks; I wasn't aware of that loophole. I haven't tested all the settings, but every time I enable CMS I get swamped with a lot of tasks. ID: 43581 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1556 Credit: 10,101,515 RAC: 1,464	Message 43583 - Posted: 7 Nov 2020, 9:50:33 UTC - in response to Message 43578. To get 150 CMS tasks is easy, all you have to do is enable 'If no work for selected applications is available, accept work from other applications?' and within an hour you get that amount despite of your cache size settings. Ah, thanks; I wasn't aware of that loophole. This is explainable BOINC behaviour. On the 1st of November you had several short running tasks (700 seconds) after each other, exiting cause of NO_SUB_TASKS. For BOINC they were OK and BOINC calculates that your machine is able to run that application within a certain amount of time. When you have a big cache buffer to fill (days) and that application is available again, BOINC is expecting you can do that application in about 700 seconds and needs a lot of tasks to fill your cache. ID: 43583 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 806 Credit: 66,047,456 RAC: 27,780	Message 43584 - Posted: 7 Nov 2020, 15:57:03 UTC - in response to Message 43583. I thought that tasks that error out are omitted from APR calculation? ID: 43584 · Reply Quote

Henry Nebrensky Send message Joined: 13 Jul 05 Posts: 170 Credit: 15,020,549 RAC: 0	Message 43587 - Posted: 8 Nov 2020, 20:08:54 UTC - in response to Message 43584. Last modified: 8 Nov 2020, 20:12:34 UTC My 4-core box will readily get swamped by 100+ CMS tasks, to the extent that I will only tick the CMS box when down to the last few CMS tasks if I'm around to watch what happens and untick it in a hurry! This happens even when CMS is running fine - no correlation with job failures. Set to 0.5 days + 1.5 days. "accept work from other applications?" unticked. It might be trying to fill the local queue wrt the CMS deadline, but since letting it do this can force the other sub-projects off the machine I've never left it get that far. My pet theory is that this is some side-effect of how Atlas and Theory artificially limit their task numbers based on core numbers (such that I can't get 2-day's-worth work between them). That, and the usual chaos one gets from trying to micro-manage the BOINC client (see above) rather than just let it get on with it. p.s. This discussion should probably be in this thread? ID: 43587 · Reply Quote

Jim1348 Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0	Message 43594 - Posted: 12 Nov 2020, 19:10:12 UTC - in response to Message 43587. My 4-core box will readily get swamped by 100+ CMS tasks, to the extent that I will only tick the CMS box when down to the last few CMS tasks if I'm around to watch what happens and untick it in a hurry! This happens even when CMS is running fine - no correlation with job failures. Set to 0.5 days + 1.5 days. "accept work from other applications?" unticked. It might be trying to fill the local queue wrt the CMS deadline, but since letting it do this can force the other sub-projects off the machine I've never left it get that far. That just happened to me on a newly-attached machine, with 0.5 + 0.5 days buffer setting. It ran fine for a few days, and then went bonkers. It stopped after filling the buffer with 7 days of work, and I have set NNW. https://lhcathome.cern.ch/lhcathome/results.php?hostid=10671473 I have noticed it on various projects (including WCG) ever since the recent versions of BOINC came out. They made a change to the scheduler (always a dangerous thing to do), and no one knows what is happening. ID: 43594 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1556 Credit: 10,101,515 RAC: 1,464	Message 43595 - Posted: 12 Nov 2020, 20:51:25 UTC - in response to Message 43594. I have noticed it on various projects (including WCG) ever since the recent versions of BOINC came out. They made a change to the scheduler (always a dangerous thing to do), and no one knows what is happening. The same happened to me on PrimeGrid on 23 Oct. using BOINC version 7.16.11 Quoting myself from PrimeGrid's message board: I'm responsible for 171 aborted tasks. In spite of having the cache buffer set to 0.04 + 0.01 additional, tasks kept on flowing to one of my machines. No idea why. From the about 750 tasks, I crunched the certificate ones (about 375) and aborted 171 from the normal long runners (more aborts to come tomorrow). ID: 43595 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1159 Credit: 11,871,672 RAC: 7,410	Message 43613 - Posted: 15 Nov 2020, 19:54:58 UTC - in response to Message 43587. My 4-core box will readily get swamped by 100+ CMS tasks, to the extent that I will only tick the CMS box when down to the last few CMS tasks if I'm around to watch what happens and untick it in a hurry! This happens even when CMS is running fine - no correlation with job failures. Set to 0.5 days + 1.5 days. "accept work from other applications?" unticked. It might be trying to fill the local queue wrt the CMS deadline, but since letting it do this can force the other sub-projects off the machine I've never left it get that far. My pet theory is that this is some side-effect of how Atlas and Theory artificially limit their task numbers based on core numbers (such that I can't get 2-day's-worth work between them). That, and the usual chaos one gets from trying to micro-manage the BOINC client (see above) rather than just let it get on with it. p.s. This discussion should probably be in this thread? This is all a bit strange to me. As far as I recall I've never got more CMS tasks than the physical number of CPUs in my machines -- I have one 4-core machine set to work preferences of Max 6 tasks, Max 6 CPUs; it only ever gets 4 tasks. 6-core and 40 core "work" servers get no more than 6 tasks at any one time. Perhaps something's changed since I stopped running them intensively during lockdown(s). ID: 43613 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 806 Credit: 66,047,456 RAC: 27,780	Message 43615 - Posted: 15 Nov 2020, 21:11:29 UTC - in response to Message 43613. Last modified: 15 Nov 2020, 21:13:18 UTC My 4-core box will readily get swamped by 100+ CMS tasks, to the extent that I will only tick the CMS box when down to the last few CMS tasks if I'm around to watch what happens and untick it in a hurry! This happens even when CMS is running fine - no correlation with job failures. Set to 0.5 days + 1.5 days. "accept work from other applications?" unticked. It might be trying to fill the local queue wrt the CMS deadline, but since letting it do this can force the other sub-projects off the machine I've never left it get that far. My pet theory is that this is some side-effect of how Atlas and Theory artificially limit their task numbers based on core numbers (such that I can't get 2-day's-worth work between them). That, and the usual chaos one gets from trying to micro-manage the BOINC client (see above) rather than just let it get on with it. p.s. This discussion should probably be in this thread? This is all a bit strange to me. As far as I recall I've never got more CMS tasks than the physical number of CPUs in my machines -- I have one 4-core machine set to work preferences of Max 6 tasks, Max 6 CPUs; it only ever gets 4 tasks. 6-core and 40 core "work" servers get no more than 6 tasks at any one time. Perhaps something's changed since I stopped running them intensively during lockdown(s). Could you try how many you get if you don't limit the number of tasks? Those of us that have more than 8 cores the 8 max tasks is not enough to keep our machines fully loaded. Atlas and Theory limits my tasks to 8+8 even with 'No limit' setting for number of tasks but CMS seems different. ID: 43615 · Reply Quote