Message boards :
CMS Application :
Please check your task times and your IPv6 connectivity
Message board moderation
Author | Message |
---|---|
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 298 |
Since the last WMAgent update we have had a large number of job failures, as reported on the job graphs and the overall number of tasks running is higher than normal. However, the overall number of failures reported by WMStats is not unusual. There are, though, a high number of jobs reporting as success with exit code 0 which are re-runs after an initial failure. Most of the jobs failures I see are code 8002 -- fatal exception. I can only look at a small number of these, but they all failed because of an inability to connect to the conditions database (frontier) servers via IPv6: An exception of category 'StdException' occurred while [0] Constructing the EventProcessor [1] Constructing ESSource: class=PoolDBESSource label='GlobalTag' Exception Message: A std::exception was thrown. Can not get data (Additional Information: [frontier.c:1159]: No more servers/proxies. Last error was: Request 131 on chan 1 failed at Sun Oct 25 10:00:01 2020: -9 [fn-socket.c:85]: network error on connect to 2606:4700:3032::681c:94c: Network is unreachable) ( CORAL : "coral::FrontierAccess::Statement::execute" from "CORAL/RelationalPlugins/frontier" ) Now, many of these failures are coming from the same users -- and indeed machines. Their task reports are characterised by only lasting 30-60 minutes and only consuming a few minutes of CPU; nevertheless, they are flagged as valid. Properly running CMS tasks should run for 12-18 hours and calculate several of our jobs in that time. My assumption is that the 8002 failures are resubmitted to another machine and properly run the second second time, getting a zero exit code (success) I managed to contact one of the users who was getting many 8002 errors and he reported that his IPv6 connection had inadvertently been turned off at his router. However his jobs are still running short with little CPU usage, but about 50% of them now report an error while computing. I'm still not sure exactly where the problem lies, but I would ask that you check your CMS task reports and note if many are running for less than 12 hours, with very low CPU usage. If you find that, please check that your router and firewall are letting through IPv6 requests on the ports listed in our FAQ. Thanks. |
Send message Joined: 2 May 07 Posts: 2244 Credit: 173,902,375 RAC: 456 |
Hello Ivan, saw this task from last night. Stopped at changing from CEST to CET: https://lhcathome.cern.ch/lhcathome/result.php?resultid=288254714 IPv6 not found so long. |
Send message Joined: 28 Sep 04 Posts: 732 Credit: 49,367,266 RAC: 17,281 |
Hello Ivan, I got a bunch of these as well. All CMS and Theory tasks that were running failed when Daylight Savings Time ended and computer moved clock one hour back. Failure was 'VM Heartbeat file specified, but missing heartbeat.' Atlas tasks seem to have survived. |
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 298 |
|
Send message Joined: 2 May 07 Posts: 2244 Credit: 173,902,375 RAC: 456 |
Interesting. I would have hoped that such checks replied on the UNIX epoch rather than local time. We are hoping this was the last change in the EU from CEST to CET. No one need it in the future. |
Send message Joined: 14 Jan 10 Posts: 1422 Credit: 9,484,585 RAC: 1,266 |
We are hoping this was the last change in the EU from CEST to CET. Unfortunately, European politicians cannot decide what will be the standard time in the future (CET or CEST), so "Joe Bloggs" and "Fred Bloggs" have to switch their watches for the time being. Btw: I removed the heartbeat mechanism check from my CMS-job xml-file, so I cannot be surprised anymore by a killed VM. |
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 298 |
|
Send message Joined: 24 Oct 04 Posts: 1176 Credit: 54,887,670 RAC: 5,761 |
We have had that time change problem here many times over the years. (Fall back to Pacific Standard Time: November 1, 2020 for me) (Sunday, October 25, 2020 for the Geneva time change) |
Send message Joined: 15 Jun 08 Posts: 2541 Credit: 254,608,838 RAC: 34,609 |
CMS queue is dry again. It looks like computers suffering from those or other problems quickly eat up all available subtasks in the queues. It also looks like the automatic host banishment implemented in BOINC is too slow to cover that. The question is whether it is possible to manually exclude the most affected hosts from getting tasks until their volunteers contact the project admins or the forum. |
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 298 |
CMS queue is dry again. It shouldn't have been, at least from my monitoring. (I did have a monitoring problem over the weekend, turned out that my browser queue was corrupted -- clearing all history data eventually cleared it.) I'm just waiting on Federica to analyse the latest condor logs to see if our "rogue" volunteer is still the main problem with jobs being flagged as failures in the graphs. Unfortunately I don't have BOINC admin rights at LHC@Home (I used to when CMS@Home was a separate entity). If we confirm that our Finnish friend is still the main culprit, I can ask Nils to email him directly (he's not answered my private message); failing that, I can then ask for him to be banned. Unfortunately he has a 64-processor computer and somehow manages to get over 150 tasks queued at any one time... |
Send message Joined: 28 Sep 04 Posts: 732 Credit: 49,367,266 RAC: 17,281 |
To get 150 CMS tasks is easy, all you have to do is enable 'If no work for selected applications is available, accept work from other applications?' and within an hour you get that amount despite of your cache size settings. |
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 298 |
|
Send message Joined: 28 Sep 04 Posts: 732 Credit: 49,367,266 RAC: 17,281 |
To get 150 CMS tasks is easy, all you have to do is enable 'If no work for selected applications is available, accept work from other applications?' and within an hour you get that amount despite of your cache size settings. I haven't tested all the settings, but every time I enable CMS I get swamped with a lot of tasks. |
Send message Joined: 14 Jan 10 Posts: 1422 Credit: 9,484,585 RAC: 1,266 |
This is explainable BOINC behaviour.To get 150 CMS tasks is easy, all you have to do is enable 'If no work for selected applications is available, accept work from other applications?' and within an hour you get that amount despite of your cache size settings. On the 1st of November you had several short running tasks (700 seconds) after each other, exiting cause of NO_SUB_TASKS. For BOINC they were OK and BOINC calculates that your machine is able to run that application within a certain amount of time. When you have a big cache buffer to fill (days) and that application is available again, BOINC is expecting you can do that application in about 700 seconds and needs a lot of tasks to fill your cache. |
Send message Joined: 28 Sep 04 Posts: 732 Credit: 49,367,266 RAC: 17,281 |
I thought that tasks that error out are omitted from APR calculation? |
Send message Joined: 13 Jul 05 Posts: 169 Credit: 15,000,737 RAC: 2 |
My 4-core box will readily get swamped by 100+ CMS tasks, to the extent that I will only tick the CMS box when down to the last few CMS tasks if I'm around to watch what happens and untick it in a hurry! This happens even when CMS is running fine - no correlation with job failures. Set to 0.5 days + 1.5 days. "accept work from other applications?" unticked. It might be trying to fill the local queue wrt the CMS deadline, but since letting it do this can force the other sub-projects off the machine I've never left it get that far. My pet theory is that this is some side-effect of how Atlas and Theory artificially limit their task numbers based on core numbers (such that I can't get 2-day's-worth work between them). That, and the usual chaos one gets from trying to micro-manage the BOINC client (see above) rather than just let it get on with it. p.s. This discussion should probably be in this thread? |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 |
My 4-core box will readily get swamped by 100+ CMS tasks, to the extent that I will only tick the CMS box when down to the last few CMS tasks if I'm around to watch what happens and untick it in a hurry! This happens even when CMS is running fine - no correlation with job failures. Set to 0.5 days + 1.5 days. "accept work from other applications?" unticked. It might be trying to fill the local queue wrt the CMS deadline, but since letting it do this can force the other sub-projects off the machine I've never left it get that far. That just happened to me on a newly-attached machine, with 0.5 + 0.5 days buffer setting. It ran fine for a few days, and then went bonkers. It stopped after filling the buffer with 7 days of work, and I have set NNW. https://lhcathome.cern.ch/lhcathome/results.php?hostid=10671473 I have noticed it on various projects (including WCG) ever since the recent versions of BOINC came out. They made a change to the scheduler (always a dangerous thing to do), and no one knows what is happening. |
Send message Joined: 14 Jan 10 Posts: 1422 Credit: 9,484,585 RAC: 1,266 |
I have noticed it on various projects (including WCG) ever since the recent versions of BOINC came out. They made a change to the scheduler (always a dangerous thing to do), and no one knows what is happening.The same happened to me on PrimeGrid on 23 Oct. using BOINC version 7.16.11 Quoting myself from PrimeGrid's message board: I'm responsible for 171 aborted tasks. |
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 298 |
My 4-core box will readily get swamped by 100+ CMS tasks, to the extent that I will only tick the CMS box when down to the last few CMS tasks if I'm around to watch what happens and untick it in a hurry! This happens even when CMS is running fine - no correlation with job failures. Set to 0.5 days + 1.5 days. "accept work from other applications?" unticked. It might be trying to fill the local queue wrt the CMS deadline, but since letting it do this can force the other sub-projects off the machine I've never left it get that far. This is all a bit strange to me. As far as I recall I've never got more CMS tasks than the physical number of CPUs in my machines -- I have one 4-core machine set to work preferences of Max 6 tasks, Max 6 CPUs; it only ever gets 4 tasks. 6-core and 40 core "work" servers get no more than 6 tasks at any one time. Perhaps something's changed since I stopped running them intensively during lockdown(s). |
Send message Joined: 28 Sep 04 Posts: 732 Credit: 49,367,266 RAC: 17,281 |
My 4-core box will readily get swamped by 100+ CMS tasks, to the extent that I will only tick the CMS box when down to the last few CMS tasks if I'm around to watch what happens and untick it in a hurry! This happens even when CMS is running fine - no correlation with job failures. Set to 0.5 days + 1.5 days. "accept work from other applications?" unticked. It might be trying to fill the local queue wrt the CMS deadline, but since letting it do this can force the other sub-projects off the machine I've never left it get that far. Could you try how many you get if you don't limit the number of tasks? Those of us that have more than 8 cores the 8 max tasks is not enough to keep our machines fully loaded. Atlas and Theory limits my tasks to 8+8 even with 'No limit' setting for number of tasks but CMS seems different. |
©2024 CERN