Message boards : ATLAS application : ATLAS jobs failing after longer suspension
Message board moderation

To post messages, you must log in.

AuthorMessage
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1371
Credit: 9,129,837
RAC: 3,792
Message 32746 - Posted: 10 Oct 2017, 6:49:13 UTC

After suspending a job with without leaving the task in memory e.g. cause I want to shutdown the machine overnight,
the ATLAS job does not survive after the task is resumed.
Although for BOINC the task is handled as valid, but no good ATLAS-result is uploaded.

Error code this time for Panda 92: description: pilot, 1008: General pilot error, consult batch log exe, 92: Unknown exeerrorcode error code 92

Please make a longer suspension possible with valid ATLAS results.
ID: 32746 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2189
Credit: 172,926,055
RAC: 40,772
Message 32748 - Posted: 10 Oct 2017, 7:28:52 UTC - in response to Message 32746.  

Close and exit Boincmanager and wait up if the Atlas-VM is saved in Virtualbox.
When you than shut down the Computer in standby, what reason do you have after wake up?
ID: 32748 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1371
Credit: 9,129,837
RAC: 3,792
Message 32749 - Posted: 10 Oct 2017, 7:40:09 UTC - in response to Message 32748.  
Last modified: 10 Oct 2017, 7:49:58 UTC

My sequence of actions:

- suspend the task in BOINC (leave app in memory off)
- wait until VM is properly saved
- stop BOINC client
- shutdown the machine

It's not a BOINC or wrapper problem, but it seems ATLAS doesn't accept longer suspensions.

Example task: https://lhcathome.cern.ch/lhcathome/result.php?resultid=158819174
ID: 32749 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2189
Credit: 172,926,055
RAC: 40,772
Message 32752 - Posted: 10 Oct 2017, 11:16:09 UTC

The only difference for me is Leave in memory -> on.
Had today in the morning a Windows update.
After reboot the Atlas task run without problems.
This update come also for the other two PC's. Will take a look.
ID: 32752 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 455
Credit: 198,363,188
RAC: 77,437
Message 32756 - Posted: 10 Oct 2017, 12:37:14 UTC

HM, maybe that the suspend is/was in the initial-phase of the WU, downloading more details from outside.

In this phase, Atlas-WUs are sensible regarding pausing / suspending.

Once, the initial-downloads are finished, Atlas-WUs can be suspended for quit a long time without having problems.


Supporting BOINC, a great concept !
ID: 32756 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1371
Credit: 9,129,837
RAC: 3,792
Message 32757 - Posted: 10 Oct 2017, 13:01:04 UTC - in response to Message 32756.  

HM, maybe that the suspend is/was in the initial-phase of the WU, downloading more details from outside.

In this phase, Atlas-WUs are sensible regarding pausing / suspending.

No, that was not the case:

2017-10-09 19:35:36 (5252): Guest Log: Starting ATLAS job.
2017-10-09 22:49:04 (5252): Successfully stopped VM.


@maeax:
Title of my thread longer suspension. Just rebooting is normally a short interruption of the task.
Memory on and stopping BOINC has the same effect. The task is saved to disk.
ID: 32757 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 32763 - Posted: 10 Oct 2017, 17:48:10 UTC

I am running Atlas tasks 24/7 on my Windows 10 PC. But I put NNT in order to control a batch of tasks. Today I had two running and when they completed I made a Windows update with a LHCb task running.Then I downloaded two other Atlas tasks and one of them is running on 2 cores, the other is waiting to start.About 75% of completed and validated Atlas tasks produce HITS files. The LHCb task uses 0 CPU and is doing absolutely nothing. CMS tasks fail.
Tullio
ID: 32763 · Report as offensive     Reply Quote
Jonathan

Send message
Joined: 25 Sep 17
Posts: 99
Credit: 3,425,566
RAC: 7
Message 32782 - Posted: 11 Oct 2017, 11:06:41 UTC - in response to Message 32763.  

Tullio's CMS task 158795678

<core_client_version>7.8.2</core_client_version>
<![CDATA[
<message>
The filename or extension is too long.
(0xce) - exit code 206 (0xce)
</message>
<stderr_txt>
2017-10-08 15:26:09 (3056): vboxwrapper (7.7.26196): starting
2017-10-08 15:26:09 (3056): Feature: Checkpoint interval offset (69 seconds)
2017-10-08 15:26:09 (3056): Detected: VirtualBox COM Interface (Version: 5.1.22)
2017-10-08 15:26:09 (3056): Detected: Minimum checkpoint interval (600.000000 seconds)
2017-10-08 15:26:09 (3056): Detected: Heartbeat check (file: 'heartbeat' every 1200.000000 seconds)
2017-10-08 15:26:09 (3056): Successfully copied 'init_data.xml' to the shared directory.
2017-10-08 15:26:09 (3056): Create VM. (boinc_cb05b53bc5486a6d, slot#0)
2017-10-08 15:26:09 (3056): Setting Memory Size for VM. (2048MB)
2017-10-08 15:26:09 (3056): Setting CPU Count for VM. (1)
2017-10-08 15:26:09 (3056): Setting Chipset Options for VM.
2017-10-08 15:26:09 (3056): Setting Boot Options for VM.
2017-10-08 15:26:09 (3056): Enabling VM Network Access.
2017-10-08 15:26:09 (3056): Setting Network Configuration for NAT.
2017-10-08 15:26:09 (3056): Disabling USB Support for VM.
2017-10-08 15:26:09 (3056): Disabling COM Port Support for VM.
2017-10-08 15:26:09 (3056): Disabling LPT Port Support for VM.
2017-10-08 15:26:09 (3056): Disabling Audio Support for VM.
2017-10-08 15:26:09 (3056): Disabling Clipboard Support for VM.
2017-10-08 15:26:09 (3056): Disabling Drag and Drop Support for VM.
2017-10-08 15:26:09 (3056): Adding storage controller(s) to VM.
2017-10-08 15:26:09 (3056): Adding virtual disk drive to VM. (vm_image.vdi)
2017-10-08 15:26:09 (3056): Adding VirtualBox Guest Additions to VM.
2017-10-08 15:26:09 (3056): Adding network bandwidth throttle group to VM. (Defaulting to 1024GB)
2017-10-08 15:26:09 (3056): forwarding host port 51955 to guest port 80
2017-10-08 15:26:09 (3056): Enabling remote desktop for VM.
2017-10-08 15:26:09 (3056): Enabling shared directory for VM.
2017-10-08 15:26:10 (3056): Starting VM. (boinc_cb05b53bc5486a6d, slot#0)
2017-10-08 15:26:23 (3056): Guest Log: BIOS: VirtualBox 5.1.22
2017-10-08 15:26:23 (3056): Guest Log: BIOS: ata0-0: PCHS=16383/16/63 LCHS=1024/255/63
2017-10-08 15:26:23 (3056): Guest Log: BIOS: Boot : bseqnr=1, bootseq=0032
2017-10-08 15:26:23 (3056): Guest Log: BIOS: Booting from Hard Disk...
2017-10-08 15:26:23 (3056): Guest Log: BIOS: KBD: unsupported int 16h function 03
2017-10-08 15:26:23 (3056): Guest Log: BIOS: AX=0305 BX=0000 CX=0000 DX=0000
2017-10-08 15:26:23 (3056): Successfully started VM. (PID = '10160')
2017-10-08 15:26:23 (3056): Reporting VM Process ID to BOINC.
2017-10-08 15:26:33 (3056): VM state change detected. (old = 'poweroff', new = 'running')
2017-10-08 15:26:43 (3056): Guest Log: vboxguest: misc device minor 56, IRQ 20, I/O port d020, MMIO at 00000000f0400000 (size 0x400000)
2017-10-08 15:26:43 (3056): Detected: Web Application Enabled (http://localhost:51955)
2017-10-08 15:26:43 (3056): Detected: Remote Desktop Enabled (localhost:51956)
2017-10-08 15:26:53 (3056): Preference change detected
2017-10-08 15:26:53 (3056): Setting CPU throttle for VM. (100%)
2017-10-08 15:26:53 (3056): Setting checkpoint interval to 600 seconds. (Higher value of (Preference: 60 seconds) or (Vbox_job.xml: 600 seconds))
2017-10-08 15:27:33 (3056): Guest Log: VBoxService 4.3.28 r100309 (verbosity: 0) linux.amd64 (May 13 2015 17:11:31) release log
2017-10-08 15:27:33 (3056): Guest Log: 00:00:00.000091 main Log opened 2017-10-08T13:27:22.351168000Z
2017-10-08 15:27:33 (3056): Guest Log: 00:00:00.000339 main OS Product: Linux
2017-10-08 15:27:33 (3056): Guest Log: 00:00:00.000394 main OS Release: 4.1.34-22.cernvm.x86_64
2017-10-08 15:27:33 (3056): Guest Log: 00:00:00.000435 main OS Version: #1 SMP Mon Oct 24 14:29:58 CEST 2016
2017-10-08 15:27:33 (3056): Guest Log: 00:00:00.000476 main OS Service Pack: #1 SMP Mon Oct 24 14:29:58 CEST 2016
2017-10-08 15:27:33 (3056): Guest Log: 00:00:00.000518 main Executable: /usr/sbin/VBoxService
2017-10-08 15:27:33 (3056): Guest Log: 00:00:00.000519 main Process ID: 2703
2017-10-08 15:27:33 (3056): Guest Log: 00:00:00.000520 main Package type: LINUX_64BITS_GENERIC
2017-10-08 15:27:33 (3056): Guest Log: 00:00:00.001796 main 4.3.28 r100309 started. Verbose level = 0
2017-10-08 15:27:33 (3056): Guest Log: 00:00:00.052918 automount VBoxServiceAutoMountWorker: Shared folder "shared" was mounted to "/media/sf_shared"
2017-10-08 15:28:34 (3056): Guest Log: [INFO] Mounting the shared directory
2017-10-08 15:28:34 (3056): Guest Log: [INFO] Shared directory mounted, enabling vboxmonitor
2017-10-08 15:28:34 (3056): Guest Log: [DEBUG] Testing network connection to cern.ch on port 80
2017-10-08 15:28:34 (3056): Guest Log: [DEBUG] Connection to cern.ch 80 port [tcp/http] succeeded!
2017-10-08 15:28:34 (3056): Guest Log: [DEBUG] 0
2017-10-08 15:28:34 (3056): Guest Log: [DEBUG] Testing CVMFS connection to lhchomeproxy.cern.ch on port 3125
2017-10-08 15:28:34 (3056): Guest Log: [DEBUG] Connection to lhchomeproxy.cern.ch 3125 port [tcp/a13-an] succeeded!
2017-10-08 15:28:34 (3056): Guest Log: [DEBUG] 0
2017-10-08 15:28:34 (3056): Guest Log: [DEBUG] Testing VCCS connection to vccs.cern.ch on port 443
2017-10-08 15:28:34 (3056): Guest Log: [DEBUG] Connection to vccs.cern.ch 443 port [tcp/https] succeeded!
2017-10-08 15:28:34 (3056): Guest Log: [DEBUG] 0
2017-10-08 15:28:34 (3056): Guest Log: [DEBUG] Testing connection to Condor server on port 9618
2017-10-08 15:28:34 (3056): Guest Log: [DEBUG] Connection to vccondor01.cern.ch 9618 port [tcp/condor] succeeded!
2017-10-08 15:28:34 (3056): Guest Log: [DEBUG] 0
2017-10-08 15:28:44 (3056): Guest Log: [DEBUG] Probing CVMFS ...
2017-10-08 15:28:44 (3056): Guest Log: Probing /cvmfs/grid.cern.ch... OK
2017-10-08 15:29:04 (3056): Guest Log: Probing /cvmfs/cms.cern.ch... OK
2017-10-08 15:29:04 (3056): Guest Log: VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE
2017-10-08 15:29:04 (3056): Guest Log: 2.2.0.0 3409 1 19744 4925 14 1 1255772 10240001 2 65024 0 20 95 20792 0 http://cvmfs-stratum-one.cern.ch/cvmfs/grid.cern.ch http://128.142.33.31:3125 1
2017-10-08 15:29:34 (3056): VM state change detected. (old = 'running', new = 'paused')
2017-10-09 00:52:45 (3056): VM state change detected. (old = 'paused', new = 'running')
2017-10-09 00:53:15 (3056): Guest Log: [INFO] Reading volunteer information
2017-10-09 00:53:15 (3056): Guest Log: [INFO] Volunteer: tullio (96166) Host: 10407309
2017-10-09 00:53:15 (3056): Guest Log: [INFO] VMID: 6d0ae20b-f23e-4d5d-b5ca-600a8fb1d26c
2017-10-09 00:53:15 (3056): Guest Log: [INFO] Requesting an X509 credential from LHC@home
2017-10-09 00:53:15 (3056): Guest Log: [INFO] Running the fast benchmark.
2017-10-09 00:54:05 (3056): Guest Log: [INFO] Machine performance 10.82 HEPSEC06
2017-10-09 00:54:05 (3056): Guest Log: [INFO] CMS application starting. Check log files.
2017-10-09 00:54:05 (3056): Guest Log: [DEBUG] HTCondor ping
2017-10-09 00:54:15 (3056): Guest Log: [DEBUG] 0
2017-10-09 02:30:15 (3056): Status Report: Job Duration: '64800.000000'
2017-10-09 02:30:15 (3056): Status Report: Elapsed Time: '6000.496927'
2017-10-09 02:30:15 (3056): Status Report: CPU Time: '230.703125'
2017-10-09 02:48:37 (3056): VM state change detected. (old = 'running', new = 'paused')
2017-10-09 04:29:04 (3056): VM state change detected. (old = 'paused', new = 'running')
2017-10-09 04:29:24 (3056): Guest Log: [ERROR] Condor exited after 12916s without running a job.
2017-10-09 04:29:24 (3056): Guest Log: [INFO] Shutting Down.
2017-10-09 04:29:24 (3056): VM Completion File Detected.
2017-10-09 04:29:24 (3056): VM Completion Message: Condor exited after 12916s without running a job.
.
2017-10-09 04:29:24 (3056): Powering off VM.
2017-10-09 04:29:25 (3056): Successfully stopped VM.
2017-10-09 04:29:30 (3056): Deregistering VM. (boinc_cb05b53bc5486a6d, slot#0)
2017-10-09 04:29:30 (3056): Removing virtual disk drive(s) from VM.
2017-10-09 04:29:30 (3056): Removing network bandwidth throttle group from VM.
2017-10-09 04:29:30 (3056): Removing storage controller(s) from VM.
2017-10-09 04:29:30 (3056): Removing VM from VirtualBox.
04:29:36 (3056): called boinc_finish(206)

</stderr_txt>
]]>
ID: 32782 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 32798 - Posted: 11 Oct 2017, 17:12:37 UTC

Atlas tasks run on 2 cores. All other LHC tasks fail save SixTrack, which are rarely provided. Why?
Tullio
ID: 32798 · Report as offensive     Reply Quote
Jonathan

Send message
Joined: 25 Sep 17
Posts: 99
Credit: 3,425,566
RAC: 7
Message 32799 - Posted: 11 Oct 2017, 20:05:10 UTC - in response to Message 32798.  

You should start a new thread in the proper job forum with all your details, custom setup files, etc.
ID: 32799 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 32828 - Posted: 14 Oct 2017, 9:48:51 UTC - in response to Message 32799.  
Last modified: 14 Oct 2017, 9:50:11 UTC

I am running stock apps with no custom setup files, this in all my BOINC projects, SETI, Einstein, climateprediction.net,LHC.I am running single core ATLAS on a Linux box, two core Atlas on the Windows 10 PC. I run SETI and EINSTEIN GPU tasks on another Linux box with a GTX 750Ti GPU board with no error. Only LHC@home tasks,save Atlas and SixTrack, error on my fastest machine, the Windows 10 PC with 22 GB RAM.
Tullio
ID: 32828 · Report as offensive     Reply Quote
Profile HerveUAE
Avatar

Send message
Joined: 18 Dec 16
Posts: 123
Credit: 37,495,365
RAC: 0
Message 32836 - Posted: 15 Oct 2017, 18:16:10 UTC - in response to Message 32828.  
Last modified: 15 Oct 2017, 18:16:21 UTC

Same for me, and I have given up trying to run CMS, LHCb and Theory tasks on my machines. But yes, those are matters for other discussion threads.
We are the product of random evolution.
ID: 32836 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2189
Credit: 172,926,055
RAC: 40,772
Message 32848 - Posted: 17 Oct 2017, 1:50:13 UTC

Had a problem because of a Windows Update.

This task was suspended for six hours.

2017-10-16 16:09:20 (8052): VM state change detected. (old = 'running', new = 'paused')
2017-10-17 02:38:31 (8052): VM state change detected. (old = 'paused', new = 'running')

https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=76552885

After a restart it finished with a small work and get cobblestones.
ID: 32848 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1371
Credit: 9,129,837
RAC: 3,792
Message 32852 - Posted: 18 Oct 2017, 7:07:22 UTC

Another 2 tasks were early killed after the resume when done their night sleep. Both VM's were properly saved and had done about 30 events.
After the resume of the 2 tasks with 40 seconds in between, both VM's stopped after about 3 minutes.

https://lhcathome.cern.ch/lhcathome/result.php?resultid=159256150
https://lhcathome.cern.ch/lhcathome/result.php?resultid=159262082

For Theory this is working fine for several years now, but for ATLAS (and CMS) it's not working like BOINC has designed suspending and resuming.
A resumed Theory job in the VM runs on where it was saved.

Suspending a task could be done for several reasons, so should work well for the science done inside a VM too.

Some suspend possibilities:
- Users wish
- Users host busy with no_BOINC tasks
- Other project higher resource share
- Other tasks start running in high priority cause possible deadline miss
- etc etc

Please make suspend working well for ATLAS like it does for Theory.
ID: 32852 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 387
Credit: 15,314,184
RAC: 0
Message 32854 - Posted: 19 Oct 2017, 9:36:50 UTC - in response to Message 32852.  

This is due to the "pilot killed looping job" errors reported on the other thread. I found out that in ATLAS code there is a periodic check that files used by the WU are updated, and if they are not it thinks the process is stuck and terminates it. When a WU is resumed after a long (more than 12 hours) suspension this check is run and since nothing was updated since the suspension the process is killed.

I've now increased this time limit to one week so new WU submitted should be ok to suspend overnight or even for a few days.
ID: 32854 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1371
Credit: 9,129,837
RAC: 3,792
Message 32858 - Posted: 19 Oct 2017, 14:30:09 UTC - in response to Message 32854.  

When a WU is resumed after a long (more than 12 hours) suspension this check is run and since nothing was updated since the suspension the process is killed.

I've always short nights. Less than 12 hours:

2017-10-17 22:42:10 (12768): Stopping VM.
2017-10-17 22:42:43 (12768): Successfully stopped VM.
2017-10-18 07:20:06 (6028): vboxwrapper (7.7.26196): starting


2017-10-17 22:43:07 (8516): Stopping VM.
2017-10-17 22:43:34 (8516): Successfully stopped VM.
2017-10-18 07:20:47 (5476): vboxwrapper (7.7.26196): starting


This afternoon I tested a suspension of 1 hour and 22 minutes (Guest host hibernated) and those 2 tasks survived:

LHC@home 19 Oct 14:01:41 task DJWLDmQRgOrnSu7Ccp2YYBZmABFKDmABFKDmOONKDmABFKDm8m4Dgn_1 suspended by user
LHC@home 19 Oct 14:01:41 task r7pKDmpQLPrnDDn7oo6G73TpABFKDmABFKDmNMMKDmABFKDmVyT6Co_0 suspended by user
LHC@home 19 Oct 15:23:19 task

DJWLDmQRgOrnSu7Ccp2YYBZmABFKDmABFKDmOONKDmABFKDm8m4Dgn_1 resumed by user
LHC@home 19 Oct 15:23:39 task r7pKDmpQLPrnDDn7oo6G73TpABFKDmABFKDmNMMKDmABFKDmVyT6Co_0 resumed by user
ID: 32858 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2189
Credit: 172,926,055
RAC: 40,772
Message 34914 - Posted: 8 Apr 2018, 15:38:58 UTC - in response to Message 32854.  
Last modified: 8 Apr 2018, 15:39:48 UTC

David Cameron wrote:
This is due to the "pilot killed looping job" errors reported on the other thread. I found out that in ATLAS code there is a periodic check that files used by the WU are updated, and if they are not it thinks the process is stuck and terminates it. When a WU is resumed after a long (more than 12 hours) suspension this check is run and since nothing was updated since the suspension the process is killed.

I've now increased this time limit to one week so new WU submitted should be ok to suspend overnight or even for a few days.


Have a PC with some crash's in Windows 10 pro(x64) today.
Atlas show in the Console the last Collisions of work when it begin a new time, BUT
after a few seconds it begin from the start with the first Collision. Every time the PC was crashed!
Is it possible to safe the running WU in progress every 15 Min. or so....
ID: 34914 · Report as offensive     Reply Quote

Message boards : ATLAS application : ATLAS jobs failing after longer suspension


©2024 CERN