Thread 'ATLAS jobs failing after longer suspension'

Author	Message
Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1561 Credit: 10,123,480 RAC: 1,291	Message 32746 - Posted: 10 Oct 2017, 6:49:13 UTC After suspending a job with without leaving the task in memory e.g. cause I want to shutdown the machine overnight, the ATLAS job does not survive after the task is resumed. Although for BOINC the task is handled as valid, but no good ATLAS-result is uploaded. Error code this time for Panda 92: description: pilot, 1008: General pilot error, consult batch log exe, 92: Unknown exeerrorcode error code 92 Please make a longer suspension possible with valid ATLAS results. ID: 32746 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2305 Credit: 179,727,092 RAC: 2,365	Message 32748 - Posted: 10 Oct 2017, 7:28:52 UTC - in response to Message 32746. Close and exit Boincmanager and wait up if the Atlas-VM is saved in Virtualbox. When you than shut down the Computer in standby, what reason do you have after wake up? ID: 32748 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1561 Credit: 10,123,480 RAC: 1,291	Message 32749 - Posted: 10 Oct 2017, 7:40:09 UTC - in response to Message 32748. Last modified: 10 Oct 2017, 7:49:58 UTC My sequence of actions: - suspend the task in BOINC (leave app in memory off) - wait until VM is properly saved - stop BOINC client - shutdown the machine It's not a BOINC or wrapper problem, but it seems ATLAS doesn't accept longer suspensions. Example task: https://lhcathome.cern.ch/lhcathome/result.php?resultid=158819174 ID: 32749 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2305 Credit: 179,727,092 RAC: 2,365	Message 32752 - Posted: 10 Oct 2017, 11:16:09 UTC The only difference for me is Leave in memory -> on. Had today in the morning a Windows update. After reboot the Atlas task run without problems. This update come also for the other two PC's. Will take a look. ID: 32752 · Reply Quote

Yeti Volunteer moderator Send message Joined: 2 Sep 04 Posts: 468 Credit: 224,935,712 RAC: 402	Message 32756 - Posted: 10 Oct 2017, 12:37:14 UTC HM, maybe that the suspend is/was in the initial-phase of the WU, downloading more details from outside. In this phase, Atlas-WUs are sensible regarding pausing / suspending. Once, the initial-downloads are finished, Atlas-WUs can be suspended for quit a long time without having problems. Supporting BOINC, a great concept ! ID: 32756 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1561 Credit: 10,123,480 RAC: 1,291	Message 32757 - Posted: 10 Oct 2017, 13:01:04 UTC - in response to Message 32756. HM, maybe that the suspend is/was in the initial-phase of the WU, downloading more details from outside. In this phase, Atlas-WUs are sensible regarding pausing / suspending. No, that was not the case: 2017-10-09 19:35:36 (5252): Guest Log: Starting ATLAS job. 2017-10-09 22:49:04 (5252): Successfully stopped VM. @maeax: Title of my thread longer suspension. Just rebooting is normally a short interruption of the task. Memory on and stopping BOINC has the same effect. The task is saved to disk. ID: 32757 · Reply Quote

tullio Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0	Message 32763 - Posted: 10 Oct 2017, 17:48:10 UTC I am running Atlas tasks 24/7 on my Windows 10 PC. But I put NNT in order to control a batch of tasks. Today I had two running and when they completed I made a Windows update with a LHCb task running.Then I downloaded two other Atlas tasks and one of them is running on 2 cores, the other is waiting to start.About 75% of completed and validated Atlas tasks produce HITS files. The LHCb task uses 0 CPU and is doing absolutely nothing. CMS tasks fail. Tullio ID: 32763 · Reply Quote

Jonathan Send message Joined: 25 Sep 17 Posts: 99 Credit: 3,425,566 RAC: 0	Message 32782 - Posted: 11 Oct 2017, 11:06:41 UTC - in response to Message 32763. Tullio's CMS task 158795678 <core_client_version>7.8.2</core_client_version> <![CDATA[ <message> The filename or extension is too long. (0xce) - exit code 206 (0xce) </message> <stderr_txt> 2017-10-08 15:26:09 (3056): vboxwrapper (7.7.26196): starting 2017-10-08 15:26:09 (3056): Feature: Checkpoint interval offset (69 seconds) 2017-10-08 15:26:09 (3056): Detected: VirtualBox COM Interface (Version: 5.1.22) 2017-10-08 15:26:09 (3056): Detected: Minimum checkpoint interval (600.000000 seconds) 2017-10-08 15:26:09 (3056): Detected: Heartbeat check (file: 'heartbeat' every 1200.000000 seconds) 2017-10-08 15:26:09 (3056): Successfully copied 'init_data.xml' to the shared directory. 2017-10-08 15:26:09 (3056): Create VM. (boinc_cb05b53bc5486a6d, slot#0) 2017-10-08 15:26:09 (3056): Setting Memory Size for VM. (2048MB) 2017-10-08 15:26:09 (3056): Setting CPU Count for VM. (1) 2017-10-08 15:26:09 (3056): Setting Chipset Options for VM. 2017-10-08 15:26:09 (3056): Setting Boot Options for VM. 2017-10-08 15:26:09 (3056): Enabling VM Network Access. 2017-10-08 15:26:09 (3056): Setting Network Configuration for NAT. 2017-10-08 15:26:09 (3056): Disabling USB Support for VM. 2017-10-08 15:26:09 (3056): Disabling COM Port Support for VM. 2017-10-08 15:26:09 (3056): Disabling LPT Port Support for VM. 2017-10-08 15:26:09 (3056): Disabling Audio Support for VM. 2017-10-08 15:26:09 (3056): Disabling Clipboard Support for VM. 2017-10-08 15:26:09 (3056): Disabling Drag and Drop Support for VM. 2017-10-08 15:26:09 (3056): Adding storage controller(s) to VM. 2017-10-08 15:26:09 (3056): Adding virtual disk drive to VM. (vm_image.vdi) 2017-10-08 15:26:09 (3056): Adding VirtualBox Guest Additions to VM. 2017-10-08 15:26:09 (3056): Adding network bandwidth throttle group to VM. (Defaulting to 1024GB) 2017-10-08 15:26:09 (3056): forwarding host port 51955 to guest port 80 2017-10-08 15:26:09 (3056): Enabling remote desktop for VM. 2017-10-08 15:26:09 (3056): Enabling shared directory for VM. 2017-10-08 15:26:10 (3056): Starting VM. (boinc_cb05b53bc5486a6d, slot#0) 2017-10-08 15:26:23 (3056): Guest Log: BIOS: VirtualBox 5.1.22 2017-10-08 15:26:23 (3056): Guest Log: BIOS: ata0-0: PCHS=16383/16/63 LCHS=1024/255/63 2017-10-08 15:26:23 (3056): Guest Log: BIOS: Boot : bseqnr=1, bootseq=0032 2017-10-08 15:26:23 (3056): Guest Log: BIOS: Booting from Hard Disk... 2017-10-08 15:26:23 (3056): Guest Log: BIOS: KBD: unsupported int 16h function 03 2017-10-08 15:26:23 (3056): Guest Log: BIOS: AX=0305 BX=0000 CX=0000 DX=0000 2017-10-08 15:26:23 (3056): Successfully started VM. (PID = '10160') 2017-10-08 15:26:23 (3056): Reporting VM Process ID to BOINC. 2017-10-08 15:26:33 (3056): VM state change detected. (old = 'poweroff', new = 'running') 2017-10-08 15:26:43 (3056): Guest Log: vboxguest: misc device minor 56, IRQ 20, I/O port d020, MMIO at 00000000f0400000 (size 0x400000) 2017-10-08 15:26:43 (3056): Detected: Web Application Enabled (http://localhost:51955) 2017-10-08 15:26:43 (3056): Detected: Remote Desktop Enabled (localhost:51956) 2017-10-08 15:26:53 (3056): Preference change detected 2017-10-08 15:26:53 (3056): Setting CPU throttle for VM. (100%) 2017-10-08 15:26:53 (3056): Setting checkpoint interval to 600 seconds. (Higher value of (Preference: 60 seconds) or (Vbox_job.xml: 600 seconds)) 2017-10-08 15:27:33 (3056): Guest Log: VBoxService 4.3.28 r100309 (verbosity: 0) linux.amd64 (May 13 2015 17:11:31) release log 2017-10-08 15:27:33 (3056): Guest Log: 00:00:00.000091 main Log opened 2017-10-08T13:27:22.351168000Z 2017-10-08 15:27:33 (3056): Guest Log: 00:00:00.000339 main OS Product: Linux 2017-10-08 15:27:33 (3056): Guest Log: 00:00:00.000394 main OS Release: 4.1.34-22.cernvm.x86_64 2017-10-08 15:27:33 (3056): Guest Log: 00:00:00.000435 main OS Version: #1 SMP Mon Oct 24 14:29:58 CEST 2016 2017-10-08 15:27:33 (3056): Guest Log: 00:00:00.000476 main OS Service Pack: #1 SMP Mon Oct 24 14:29:58 CEST 2016 2017-10-08 15:27:33 (3056): Guest Log: 00:00:00.000518 main Executable: /usr/sbin/VBoxService 2017-10-08 15:27:33 (3056): Guest Log: 00:00:00.000519 main Process ID: 2703 2017-10-08 15:27:33 (3056): Guest Log: 00:00:00.000520 main Package type: LINUX_64BITS_GENERIC 2017-10-08 15:27:33 (3056): Guest Log: 00:00:00.001796 main 4.3.28 r100309 started. Verbose level = 0 2017-10-08 15:27:33 (3056): Guest Log: 00:00:00.052918 automount VBoxServiceAutoMountWorker: Shared folder "shared" was mounted to "/media/sf_shared" 2017-10-08 15:28:34 (3056): Guest Log: [INFO] Mounting the shared directory 2017-10-08 15:28:34 (3056): Guest Log: [INFO] Shared directory mounted, enabling vboxmonitor 2017-10-08 15:28:34 (3056): Guest Log: [DEBUG] Testing network connection to cern.ch on port 80 2017-10-08 15:28:34 (3056): Guest Log: [DEBUG] Connection to cern.ch 80 port [tcp/http] succeeded! 2017-10-08 15:28:34 (3056): Guest Log: [DEBUG] 0 2017-10-08 15:28:34 (3056): Guest Log: [DEBUG] Testing CVMFS connection to lhchomeproxy.cern.ch on port 3125 2017-10-08 15:28:34 (3056): Guest Log: [DEBUG] Connection to lhchomeproxy.cern.ch 3125 port [tcp/a13-an] succeeded! 2017-10-08 15:28:34 (3056): Guest Log: [DEBUG] 0 2017-10-08 15:28:34 (3056): Guest Log: [DEBUG] Testing VCCS connection to vccs.cern.ch on port 443 2017-10-08 15:28:34 (3056): Guest Log: [DEBUG] Connection to vccs.cern.ch 443 port [tcp/https] succeeded! 2017-10-08 15:28:34 (3056): Guest Log: [DEBUG] 0 2017-10-08 15:28:34 (3056): Guest Log: [DEBUG] Testing connection to Condor server on port 9618 2017-10-08 15:28:34 (3056): Guest Log: [DEBUG] Connection to vccondor01.cern.ch 9618 port [tcp/condor] succeeded! 2017-10-08 15:28:34 (3056): Guest Log: [DEBUG] 0 2017-10-08 15:28:44 (3056): Guest Log: [DEBUG] Probing CVMFS ... 2017-10-08 15:28:44 (3056): Guest Log: Probing /cvmfs/grid.cern.ch... OK 2017-10-08 15:29:04 (3056): Guest Log: Probing /cvmfs/cms.cern.ch... OK 2017-10-08 15:29:04 (3056): Guest Log: VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE 2017-10-08 15:29:04 (3056): Guest Log: 2.2.0.0 3409 1 19744 4925 14 1 1255772 10240001 2 65024 0 20 95 20792 0 http://cvmfs-stratum-one.cern.ch/cvmfs/grid.cern.ch http://128.142.33.31:3125 1 2017-10-08 15:29:34 (3056): VM state change detected. (old = 'running', new = 'paused') 2017-10-09 00:52:45 (3056): VM state change detected. (old = 'paused', new = 'running') 2017-10-09 00:53:15 (3056): Guest Log: [INFO] Reading volunteer information 2017-10-09 00:53:15 (3056): Guest Log: [INFO] Volunteer: tullio (96166) Host: 10407309 2017-10-09 00:53:15 (3056): Guest Log: [INFO] VMID: 6d0ae20b-f23e-4d5d-b5ca-600a8fb1d26c 2017-10-09 00:53:15 (3056): Guest Log: [INFO] Requesting an X509 credential from LHC@home 2017-10-09 00:53:15 (3056): Guest Log: [INFO] Running the fast benchmark. 2017-10-09 00:54:05 (3056): Guest Log: [INFO] Machine performance 10.82 HEPSEC06 2017-10-09 00:54:05 (3056): Guest Log: [INFO] CMS application starting. Check log files. 2017-10-09 00:54:05 (3056): Guest Log: [DEBUG] HTCondor ping 2017-10-09 00:54:15 (3056): Guest Log: [DEBUG] 0 2017-10-09 02:30:15 (3056): Status Report: Job Duration: '64800.000000' 2017-10-09 02:30:15 (3056): Status Report: Elapsed Time: '6000.496927' 2017-10-09 02:30:15 (3056): Status Report: CPU Time: '230.703125' 2017-10-09 02:48:37 (3056): VM state change detected. (old = 'running', new = 'paused') 2017-10-09 04:29:04 (3056): VM state change detected. (old = 'paused', new = 'running') 2017-10-09 04:29:24 (3056): Guest Log: [ERROR] Condor exited after 12916s without running a job. 2017-10-09 04:29:24 (3056): Guest Log: [INFO] Shutting Down. 2017-10-09 04:29:24 (3056): VM Completion File Detected. 2017-10-09 04:29:24 (3056): VM Completion Message: Condor exited after 12916s without running a job. . 2017-10-09 04:29:24 (3056): Powering off VM. 2017-10-09 04:29:25 (3056): Successfully stopped VM. 2017-10-09 04:29:30 (3056): Deregistering VM. (boinc_cb05b53bc5486a6d, slot#0) 2017-10-09 04:29:30 (3056): Removing virtual disk drive(s) from VM. 2017-10-09 04:29:30 (3056): Removing network bandwidth throttle group from VM. 2017-10-09 04:29:30 (3056): Removing storage controller(s) from VM. 2017-10-09 04:29:30 (3056): Removing VM from VirtualBox. 04:29:36 (3056): called boinc_finish(206) </stderr_txt> ]]> ID: 32782 · Reply Quote

tullio Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0	Message 32798 - Posted: 11 Oct 2017, 17:12:37 UTC Atlas tasks run on 2 cores. All other LHC tasks fail save SixTrack, which are rarely provided. Why? Tullio ID: 32798 · Reply Quote

Jonathan Send message Joined: 25 Sep 17 Posts: 99 Credit: 3,425,566 RAC: 0	Message 32799 - Posted: 11 Oct 2017, 20:05:10 UTC - in response to Message 32798. You should start a new thread in the proper job forum with all your details, custom setup files, etc. ID: 32799 · Reply Quote

tullio Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0	Message 32828 - Posted: 14 Oct 2017, 9:48:51 UTC - in response to Message 32799. Last modified: 14 Oct 2017, 9:50:11 UTC I am running stock apps with no custom setup files, this in all my BOINC projects, SETI, Einstein, climateprediction.net,LHC.I am running single core ATLAS on a Linux box, two core Atlas on the Windows 10 PC. I run SETI and EINSTEIN GPU tasks on another Linux box with a GTX 750Ti GPU board with no error. Only LHC@home tasks,save Atlas and SixTrack, error on my fastest machine, the Windows 10 PC with 22 GB RAM. Tullio ID: 32828 · Reply Quote

HerveUAE Send message Joined: 18 Dec 16 Posts: 123 Credit: 37,495,365 RAC: 0	Message 32836 - Posted: 15 Oct 2017, 18:16:10 UTC - in response to Message 32828. Last modified: 15 Oct 2017, 18:16:21 UTC Same for me, and I have given up trying to run CMS, LHCb and Theory tasks on my machines. But yes, those are matters for other discussion threads. We are the product of random evolution. ID: 32836 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2305 Credit: 179,727,092 RAC: 2,365	Message 32848 - Posted: 17 Oct 2017, 1:50:13 UTC Had a problem because of a Windows Update. This task was suspended for six hours. 2017-10-16 16:09:20 (8052): VM state change detected. (old = 'running', new = 'paused') 2017-10-17 02:38:31 (8052): VM state change detected. (old = 'paused', new = 'running') https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=76552885 After a restart it finished with a small work and get cobblestones. ID: 32848 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1561 Credit: 10,123,480 RAC: 1,291	Message 32852 - Posted: 18 Oct 2017, 7:07:22 UTC Another 2 tasks were early killed after the resume when done their night sleep. Both VM's were properly saved and had done about 30 events. After the resume of the 2 tasks with 40 seconds in between, both VM's stopped after about 3 minutes. https://lhcathome.cern.ch/lhcathome/result.php?resultid=159256150 https://lhcathome.cern.ch/lhcathome/result.php?resultid=159262082 For Theory this is working fine for several years now, but for ATLAS (and CMS) it's not working like BOINC has designed suspending and resuming. A resumed Theory job in the VM runs on where it was saved. Suspending a task could be done for several reasons, so should work well for the science done inside a VM too. Some suspend possibilities: - Users wish - Users host busy with no_BOINC tasks - Other project higher resource share - Other tasks start running in high priority cause possible deadline miss - etc etc Please make suspend working well for ATLAS like it does for Theory. ID: 32852 · Reply Quote

David Cameron Project administrator Project developer Project scientist Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0	Message 32854 - Posted: 19 Oct 2017, 9:36:50 UTC - in response to Message 32852. This is due to the "pilot killed looping job" errors reported on the other thread. I found out that in ATLAS code there is a periodic check that files used by the WU are updated, and if they are not it thinks the process is stuck and terminates it. When a WU is resumed after a long (more than 12 hours) suspension this check is run and since nothing was updated since the suspension the process is killed. I've now increased this time limit to one week so new WU submitted should be ok to suspend overnight or even for a few days. ID: 32854 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1561 Credit: 10,123,480 RAC: 1,291	Message 32858 - Posted: 19 Oct 2017, 14:30:09 UTC - in response to Message 32854. When a WU is resumed after a long (more than 12 hours) suspension this check is run and since nothing was updated since the suspension the process is killed. I've always short nights. Less than 12 hours: 2017-10-17 22:42:10 (12768): Stopping VM. 2017-10-17 22:42:43 (12768): Successfully stopped VM. 2017-10-18 07:20:06 (6028): vboxwrapper (7.7.26196): starting 2017-10-17 22:43:07 (8516): Stopping VM. 2017-10-17 22:43:34 (8516): Successfully stopped VM. 2017-10-18 07:20:47 (5476): vboxwrapper (7.7.26196): starting This afternoon I tested a suspension of 1 hour and 22 minutes (Guest host hibernated) and those 2 tasks survived: LHC@home 19 Oct 14:01:41 task DJWLDmQRgOrnSu7Ccp2YYBZmABFKDmABFKDmOONKDmABFKDm8m4Dgn_1 suspended by user LHC@home 19 Oct 14:01:41 task r7pKDmpQLPrnDDn7oo6G73TpABFKDmABFKDmNMMKDmABFKDmVyT6Co_0 suspended by user LHC@home 19 Oct 15:23:19 task DJWLDmQRgOrnSu7Ccp2YYBZmABFKDmABFKDmOONKDmABFKDm8m4Dgn_1 resumed by user LHC@home 19 Oct 15:23:39 task r7pKDmpQLPrnDDn7oo6G73TpABFKDmABFKDmNMMKDmABFKDmVyT6Co_0 resumed by user ID: 32858 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2305 Credit: 179,727,092 RAC: 2,365	Message 34914 - Posted: 8 Apr 2018, 15:38:58 UTC - in response to Message 32854. Last modified: 8 Apr 2018, 15:39:48 UTC David Cameron wrote: This is due to the "pilot killed looping job" errors reported on the other thread. I found out that in ATLAS code there is a periodic check that files used by the WU are updated, and if they are not it thinks the process is stuck and terminates it. When a WU is resumed after a long (more than 12 hours) suspension this check is run and since nothing was updated since the suspension the process is killed. I've now increased this time limit to one week so new WU submitted should be ok to suspend overnight or even for a few days. Have a PC with some crash's in Windows 10 pro(x64) today. Atlas show in the Console the last Collisions of work when it begin a new time, BUT after a few seconds it begin from the start with the first Collision. Every time the PC was crashed! Is it possible to safe the running WU in progress every 15 Min. or so.... ID: 34914 · Reply Quote