Message boards :
ATLAS application :
Bad WUs?
Message board moderation
Author | Message |
---|---|
Send message Joined: 27 Sep 04 Posts: 104 Credit: 8,076,141 RAC: 2,074 |
The last two ATLAS WUs that I have downloaded appear to be faulty One ran for 3 days before I noticed that while I could access the VM, ALT-F2 had no effect at all I aborted that one The latest one has now been running for about an hour and exhibits the same behavior Also, while BOINC reports the WU as running Resource Monitor show VBox using no CPU at all I have re-booted and upgraded to the latest VBox version with no change TIA for any ideas or suggestions |
Send message Joined: 15 Jun 08 Posts: 2520 Credit: 252,438,644 RAC: 135,361 |
Unfortunately this task has no error log that can be checked: https://lhcathome.cern.ch/lhcathome/result.php?resultid=323061097 Be so kind as to post stderr.txt from the currently idle task here before you cancel that task. |
Send message Joined: 27 Sep 04 Posts: 104 Credit: 8,076,141 RAC: 2,074 |
Already aborted the second Doesn't surprise me that there is no log file, the WUs never got any CPU time But a third, previously downloaded and queued has started and is responding normally |
Send message Joined: 15 Jun 08 Posts: 2520 Credit: 252,438,644 RAC: 135,361 |
This time the task wrote an error log: https://lhcathome.cern.ch/lhcathome/result.php?resultid=323117065 The following lines show what causes it to fail. The task could not contact relevant CVMFS repositories: 2021-08-03 11:29:51 (20508): Guest Log: Probing /cvmfs/atlas.cern.ch... Failed! 2021-08-03 11:29:51 (20508): Guest Log: Probing /cvmfs/atlas-condb.cern.ch... Failed! 2021-08-03 11:29:52 (20508): Guest Log: Probing /cvmfs/grid.cern.ch... Failed! It's very unlikely that all of them fail at the very same moment since: - each repository has a couple of fail-over servers spread around the world - other users didn't have similar problems. It's more likely that there was an issue regarding your internet access or an issue within your LAN. |
Send message Joined: 2 May 07 Posts: 2230 Credit: 173,850,078 RAC: 17,550 |
https://lhcathome.cern.ch/lhcathome/result.php?resultid=323709669 2021-08-14 20:44:43 (16612): Status Report: Elapsed Time: '6000.000000' 2021-08-14 20:44:43 (16612): Status Report: CPU Time: '35.015625' 2021-08-14 22:24:55 (16612): Status Report: Elapsed Time: '12000.000000' 2021-08-14 22:24:55 (16612): Status Report: CPU Time: '47.625000' 2021-08-15 00:05:04 (16612): Status Report: Elapsed Time: '18000.000000' 2021-08-15 00:05:04 (16612): Status Report: CPU Time: '60.656250' 2021-08-15 01:45:12 (16612): Status Report: Elapsed Time: '24000.000000' 2021-08-15 01:45:12 (16612): Status Report: CPU Time: '73.750000' 2021-08-15 03:25:20 (16612): Status Report: Elapsed Time: '30000.000000' 2021-08-15 03:25:20 (16612): Status Report: CPU Time: '86.875000' 2021-08-15 05:05:29 (16612): Status Report: Elapsed Time: '36000.807667' 2021-08-15 05:05:29 (16612): Status Report: CPU Time: '99.953125' 2021-08-15 06:45:37 (16612): Status Report: Elapsed Time: '42000.807667' 2021-08-15 06:45:37 (16612): Status Report: CPU Time: '113.000000' 2021-08-15 08:25:46 (16612): Status Report: Elapsed Time: '48000.807667' 2021-08-15 08:25:46 (16612): Status Report: CPU Time: '125.218750' 2021-08-15 10:05:55 (16612): Status Report: Elapsed Time: '54000.807667' 2021-08-15 10:05:55 (16612): Status Report: CPU Time: '137.500000' 2021-08-15 11:46:04 (16612): Status Report: Elapsed Time: '60000.807667' 2021-08-15 11:46:04 (16612): Status Report: CPU Time: '150.109375' 2021-08-15 12:01:01 (16612): Powering off VM. Is it possible to stop such a task from the System?? |
Send message Joined: 14 Jan 10 Posts: 1413 Credit: 9,434,983 RAC: 9,630 |
Is it possible to stop such a task from the System?? What do you mean with: 'from the System': By the server, your OS or a clean stop instead of an abort by you? A clean stop can be done by hand or by script. Using following script, you have to adjust boincpath (D:\Boinc1) folder to your needs. @echo off set "slotdir=" set /p "slotdir=In which slot-directory is the endless ATLAS-task running you want to stop gracefully? " set boincpath="D:\Boinc1\slots\%slotdir%\shared" copy /y NUL %boincpath%\atlas_done >NUL exit |
Send message Joined: 2 May 07 Posts: 2230 Credit: 173,850,078 RAC: 17,550 |
What do you mean with: 'from the System': The task started without any work from Atlas. After 16 hours, have stopped this task manually. Have no idea, how this task is running without any Atlas collission. It's the first task showing this problem for me. The Server can stop it, when no work for hours is seeing. This is the number of successful Tasks before: Anzahl der abgeschloßenen Aufgaben 1.786 |
Send message Joined: 2 May 07 Posts: 2230 Credit: 173,850,078 RAC: 17,550 |
Wingman (Agile Boincers) finished today this task, so, no bad WU, but..this long running time without work with this Windows-task. |
Send message Joined: 26 Oct 18 Posts: 96 Credit: 4,188,598 RAC: 0 |
https://lhcathome.cern.ch/lhcathome/result.php?resultid=323711966 That one run for 100 hours on my host but after some point it wasn't doing actually anything. Now it looks like it never was doing anything but I saw it started up normally and begun crunching normally. I noticed it was gone some time ago already, but wanted to see if a miracle recovery would happen. |
Send message Joined: 2 May 07 Posts: 2230 Credit: 173,850,078 RAC: 17,550 |
Virtualbox under Windows (6.1.12 and 6.1.26): vm_image.vdi are showing yellow triangle (CMS AND ATLAS!) It seem a networking problem, when the disconnect of a task is not stopped correct. After this, there are problems with new tasks running well. https://lhcathome.cern.ch/lhcathome/result.php?resultid=323867357 This is the result of the stderr.txt atm. Seem also finding no end. 2021-08-18 01:01:29 (16452): Guest Log: Checking CVMFS... 2021-08-18 01:01:36 (16452): Guest Log: CVMFS is ok 2021-08-18 01:01:36 (16452): Guest Log: Mounting shared directory 2021-08-18 02:41:25 (16452): Status Report: Elapsed Time: '6000.000000' 2021-08-18 02:41:25 (16452): Status Report: CPU Time: '6000.937500' 2021-08-18 04:21:40 (16452): Status Report: Elapsed Time: '12000.000000' 2021-08-18 04:21:40 (16452): Status Report: CPU Time: '12002.625000' 2021-08-18 06:01:57 (16452): Status Report: Elapsed Time: '18000.000000' 2021-08-18 06:01:57 (16452): Status Report: CPU Time: '17995.812500' 2021-08-18 07:42:07 (16452): Status Report: Elapsed Time: '24000.000000' 2021-08-18 07:42:07 (16452): Status Report: CPU Time: '23998.875000' 2021-08-18 09:22:17 (16452): Status Report: Elapsed Time: '30000.000000' 2021-08-18 09:22:17 (16452): Status Report: CPU Time: '30002.203125' 2021-08-18 11:02:26 (16452): Status Report: Elapsed Time: '36000.000000' 2021-08-18 11:02:26 (16452): Status Report: CPU Time: '36005.656250' 2021-08-18 12:42:35 (16452): Status Report: Elapsed Time: '42000.000000' 2021-08-18 12:42:35 (16452): Status Report: CPU Time: '42009.125000' 2021-08-18 14:22:44 (16452): Status Report: Elapsed Time: '48000.000000' 2021-08-18 14:22:44 (16452): Status Report: CPU Time: '48012.171875' 2021-08-18 16:02:56 (16452): Status Report: Elapsed Time: '54000.000000' 2021-08-18 16:02:56 (16452): Status Report: CPU Time: '54014.734375' 2021-08-18 17:43:12 (16452): Status Report: Elapsed Time: '60000.000000' 2021-08-18 17:43:12 (16452): Status Report: CPU Time: '60015.000000' 2021-08-18 19:23:29 (16452): Status Report: Elapsed Time: '66000.000000' 2021-08-18 19:23:29 (16452): Status Report: CPU Time: '66015.500000' 2021-08-18 21:03:45 (16452): Status Report: Elapsed Time: '72000.000000' 2021-08-18 21:03:45 (16452): Status Report: CPU Time: '72014.781250' 2021-08-18 22:44:02 (16452): Status Report: Elapsed Time: '78000.000000' 2021-08-18 22:44:02 (16452): Status Report: CPU Time: '78013.890625' 2021-08-19 00:24:24 (16452): Status Report: Elapsed Time: '84000.000000' 2021-08-19 00:24:24 (16452): Status Report: CPU Time: '84003.593750' 2021-08-19 02:04:44 (16452): Status Report: Elapsed Time: '90000.000000' 2021-08-19 02:04:44 (16452): Status Report: CPU Time: '89999.718750' |
Send message Joined: 15 Jun 08 Posts: 2520 Credit: 252,438,644 RAC: 135,361 |
So far this task looks fine. Elapsed Time and CPU Time are very close together: 2021-08-19 02:04:44 (16452): Status Report: Elapsed Time: '90000.000000' 2021-08-19 02:04:44 (16452): Status Report: CPU Time: '89999.718750' Take a look at ATLAS Monitoring on console 2. It will show how many events are processed and how many are still open. If the task is running normal it writes to it's internal logfiles. They are then used by ATLAS Monitoring to calculate some numbers. Regarding the vm_image.vdi entries showing the yellow triangle. This is a problem of VirtualBox's device manager. It just tells you that the device manager did not remove the pointer to a vdi file that has already been deleted. Definitely not a network issue. More likely would be a communication problem or a timeout between VirtualBox, vboxwrapper and BOINC. |
Send message Joined: 2 May 07 Posts: 2230 Credit: 173,850,078 RAC: 17,550 |
This is not possible, because RDP is showing Kernel panic. Waiting atm for finishing of a second Atlas (your Console F2 say 158 of 200 collisions are done), a restart of Boinc and this faulty Atlas will be done. This are the last lines of the vbox.log atm: 24:51:57.510201 VRDP: Connection closed: 1 24:51:57.510328 VBVA: VRDP acceleration has been disabled. 31:11:25.437415 VRDP: New connection: 31:11:25.437592 VRDP: Connection opened (IPv6): 2 31:11:25.437765 VRDP: Negotiating security method with the client. 31:11:25.438705 VRDP: failed to access the server certificate file '': VERR_FILE_NOT_FOUND 31:11:25.438805 VRDP: Connection closed: 2 31:11:27.329274 VRDP: New connection: 31:11:27.329450 VRDP: Connection opened (IPv6): 3 31:11:27.329666 VRDP: Negotiating security method with the client. 31:11:27.341875 VRDP: Methods 0x0000001b 31:11:27.341941 VRDP: Channel: [rdpdr] [1004]. Accepted. 31:11:27.341957 VRDP: Channel: [rdpsnd] [1005]. Accepted. 31:11:27.341973 VRDP: Channel: [cliprdr] [1006]. Accepted. 31:11:27.341990 VRDP: Channel: [drdynvc] [1007]. Accepted. 31:11:27.342005 VRDP: Unsupported SEC_TAG: 0xC006/8. Skipping. 31:11:27.342020 VRDP: Unsupported SEC_TAG: 0xC00A/8. Skipping. 31:11:27.408733 VRDP: Client seems to be MSFT. 31:11:27.408777 VRDP: Logon: PCRYZEN9 (::1) build 19041. User: [] Domain: [] Screen: 0 31:11:27.408948 AUTH: User: []. Domain: []. Authentication type: [Null] 31:11:27.408963 AUTH: Access granted. 31:11:27.409828 VRDP: Enabling upstream audio. 31:11:27.409958 VBVA: VRDP acceleration has been requested. 31:11:27.414514 VMMDev: SetVideoModeHint: Got a video mode hint (1920x1080x32)@(0x0),(1;0) at 0 31:11:27.414652 VRDP: SunFlsh disabled. 31:11:27.455692 VRDP: SCARD enabled for 4 |
Send message Joined: 2 May 07 Posts: 2230 Credit: 173,850,078 RAC: 17,550 |
a restart of Boinc and this faulty Atlas will be done. After this the Atlas-Task finished correct. So, also no faulty. Will stop Atlas in Windows for a time. |
Send message Joined: 27 Sep 08 Posts: 831 Credit: 688,793,162 RAC: 131,759 |
I have some WU's that are long runners and the Alt-F2 does not do anything even after reboot. Are these legit VMs that just no one remembered to add the code? or Junk? |
Send message Joined: 27 Sep 08 Posts: 831 Credit: 688,793,162 RAC: 131,759 |
Actually few minutes after reboot they were doing something, not sure what happened in the last 24hr though |
Send message Joined: 28 Sep 04 Posts: 722 Credit: 48,417,130 RAC: 27,413 |
The latest tasks I have been running take 25...50 hours when running on single core. The consoles seem to work normally. I haven't tried rebooting though. |
Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0 |
They are taking about 6 hours on two cores (or processors?) of my Intel i5 9400F. It is a 3 cores CPU and six processors. OS is Windows 10. Tullio |
Send message Joined: 28 Sep 04 Posts: 722 Credit: 48,417,130 RAC: 27,413 |
My latest ones (downloaded about 3 days ago) are taking about 16...18 hours on a single core. So they are now shorter. Unfortunately the credit has dropped even more, Per runtime second it is now about half of what it was last week. |
Send message Joined: 2 May 07 Posts: 2230 Credit: 173,850,078 RAC: 17,550 |
This WU stop running after 10 min. for all Computer: https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=172206360 |
Send message Joined: 2 May 07 Posts: 2230 Credit: 173,850,078 RAC: 17,550 |
[2021-10-26 08:27:39] 2021-10-26 06:27:39,440 [wrapper] apfmon messages muted [2021-10-26 08:27:39] *** Error codes and diagnostics *** [2021-10-26 08:27:39] "exeErrorCode": 39, [2021-10-26 08:27:39] "exeErrorDiag": "CVMFS DBRelease setup file /cvmfs/atlas.cern.ch/repo/sw/database/DBRelease/current/setup.py was not readable", [2021-10-26 08:27:39] "pilotErrorCode": 1305, [2021-10-26 08:27:39] "pilotErrorDiag": "Failed to execute payload", [2021-10-26 08:27:39] *** Listing of results directory *** [2021-10-26 08:27:39] insgesamt 261744 Atlas native-VM CentOS7. https://lhcathome.cern.ch/lhcathome/results.php?userid=75468&offset=0&show_names=0&state=5&appid= |
©2024 CERN