Bad WUs?

Author	Message
keputnam Send message Joined: 27 Sep 04 Posts: 102 Credit: 7,295,359 RAC: 7,016	Message 45174 - Posted: 3 Aug 2021, 19:44:25 UTC Last modified: 3 Aug 2021, 19:45:40 UTC The last two ATLAS WUs that I have downloaded appear to be faulty One ran for 3 days before I noticed that while I could access the VM, ALT-F2 had no effect at all I aborted that one The latest one has now been running for about an hour and exhibits the same behavior Also, while BOINC reports the WU as running Resource Monitor show VBox using no CPU at all I have re-booted and upgraded to the latest VBox version with no change TIA for any ideas or suggestions ID: 45174 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2413 Credit: 226,535,347 RAC: 131,647	Message 45175 - Posted: 3 Aug 2021, 20:31:27 UTC - in response to Message 45174. Unfortunately this task has no error log that can be checked: https://lhcathome.cern.ch/lhcathome/result.php?resultid=323061097 Be so kind as to post stderr.txt from the currently idle task here before you cancel that task. ID: 45175 · Reply Quote

keputnam Send message Joined: 27 Sep 04 Posts: 102 Credit: 7,295,359 RAC: 7,016	Message 45176 - Posted: 3 Aug 2021, 22:56:16 UTC Already aborted the second Doesn't surprise me that there is no log file, the WUs never got any CPU time But a third, previously downloaded and queued has started and is responding normally ID: 45176 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2413 Credit: 226,535,347 RAC: 131,647	Message 45178 - Posted: 4 Aug 2021, 7:05:38 UTC - in response to Message 45176. This time the task wrote an error log: https://lhcathome.cern.ch/lhcathome/result.php?resultid=323117065 The following lines show what causes it to fail. The task could not contact relevant CVMFS repositories: 2021-08-03 11:29:51 (20508): Guest Log: Probing /cvmfs/atlas.cern.ch... Failed! 2021-08-03 11:29:51 (20508): Guest Log: Probing /cvmfs/atlas-condb.cern.ch... Failed! 2021-08-03 11:29:52 (20508): Guest Log: Probing /cvmfs/grid.cern.ch... Failed! It's very unlikely that all of them fail at the very same moment since: - each repository has a couple of fail-over servers spread around the world - other users didn't have similar problems. It's more likely that there was an issue regarding your internet access or an issue within your LAN. ID: 45178 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2101 Credit: 159,819,191 RAC: 123,837	Message 45205 - Posted: 15 Aug 2021, 10:10:36 UTC https://lhcathome.cern.ch/lhcathome/result.php?resultid=323709669 2021-08-14 20:44:43 (16612): Status Report: Elapsed Time: '6000.000000' 2021-08-14 20:44:43 (16612): Status Report: CPU Time: '35.015625' 2021-08-14 22:24:55 (16612): Status Report: Elapsed Time: '12000.000000' 2021-08-14 22:24:55 (16612): Status Report: CPU Time: '47.625000' 2021-08-15 00:05:04 (16612): Status Report: Elapsed Time: '18000.000000' 2021-08-15 00:05:04 (16612): Status Report: CPU Time: '60.656250' 2021-08-15 01:45:12 (16612): Status Report: Elapsed Time: '24000.000000' 2021-08-15 01:45:12 (16612): Status Report: CPU Time: '73.750000' 2021-08-15 03:25:20 (16612): Status Report: Elapsed Time: '30000.000000' 2021-08-15 03:25:20 (16612): Status Report: CPU Time: '86.875000' 2021-08-15 05:05:29 (16612): Status Report: Elapsed Time: '36000.807667' 2021-08-15 05:05:29 (16612): Status Report: CPU Time: '99.953125' 2021-08-15 06:45:37 (16612): Status Report: Elapsed Time: '42000.807667' 2021-08-15 06:45:37 (16612): Status Report: CPU Time: '113.000000' 2021-08-15 08:25:46 (16612): Status Report: Elapsed Time: '48000.807667' 2021-08-15 08:25:46 (16612): Status Report: CPU Time: '125.218750' 2021-08-15 10:05:55 (16612): Status Report: Elapsed Time: '54000.807667' 2021-08-15 10:05:55 (16612): Status Report: CPU Time: '137.500000' 2021-08-15 11:46:04 (16612): Status Report: Elapsed Time: '60000.807667' 2021-08-15 11:46:04 (16612): Status Report: CPU Time: '150.109375' 2021-08-15 12:01:01 (16612): Powering off VM. Is it possible to stop such a task from the System?? ID: 45205 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1280 Credit: 8,496,817 RAC: 2,374	Message 45206 - Posted: 15 Aug 2021, 13:46:14 UTC - in response to Message 45205. Is it possible to stop such a task from the System?? What do you mean with: 'from the System': By the server, your OS or a clean stop instead of an abort by you? A clean stop can be done by hand or by script. Using following script, you have to adjust boincpath (D:\Boinc1) folder to your needs. @echo off set "slotdir=" set /p "slotdir=In which slot-directory is the endless ATLAS-task running you want to stop gracefully? " set boincpath="D:\Boinc1\slots\%slotdir%\shared" copy /y NUL %boincpath%\atlas_done >NUL exit ID: 45206 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2101 Credit: 159,819,191 RAC: 123,837	Message 45207 - Posted: 15 Aug 2021, 14:24:33 UTC - in response to Message 45206. Last modified: 15 Aug 2021, 14:36:01 UTC What do you mean with: 'from the System': By the server, your OS or a clean stop instead of an abort by you? A clean stop can be done by hand or by script. The task started without any work from Atlas. After 16 hours, have stopped this task manually. Have no idea, how this task is running without any Atlas collission. It's the first task showing this problem for me. The Server can stop it, when no work for hours is seeing. This is the number of successful Tasks before: Anzahl der abgeschloßenen Aufgaben 1.786 ID: 45207 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2101 Credit: 159,819,191 RAC: 123,837	Message 45210 - Posted: 16 Aug 2021, 17:10:57 UTC - in response to Message 45207. Wingman (Agile Boincers) finished today this task, so, no bad WU, but..this long running time without work with this Windows-task. ID: 45210 · Reply Quote

Richie_unstable Send message Joined: 26 Oct 18 Posts: 91 Credit: 4,188,598 RAC: 0	Message 45225 - Posted: 18 Aug 2021, 23:14:04 UTC https://lhcathome.cern.ch/lhcathome/result.php?resultid=323711966 That one run for 100 hours on my host but after some point it wasn't doing actually anything. Now it looks like it never was doing anything but I saw it started up normally and begun crunching normally. I noticed it was gone some time ago already, but wanted to see if a miracle recovery would happen. ID: 45225 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2101 Credit: 159,819,191 RAC: 123,837	Message 45226 - Posted: 19 Aug 2021, 1:32:26 UTC - in response to Message 45225. Last modified: 19 Aug 2021, 2:08:21 UTC Virtualbox under Windows (6.1.12 and 6.1.26): vm_image.vdi are showing yellow triangle (CMS AND ATLAS!) It seem a networking problem, when the disconnect of a task is not stopped correct. After this, there are problems with new tasks running well. https://lhcathome.cern.ch/lhcathome/result.php?resultid=323867357 This is the result of the stderr.txt atm. Seem also finding no end. 2021-08-18 01:01:29 (16452): Guest Log: Checking CVMFS... 2021-08-18 01:01:36 (16452): Guest Log: CVMFS is ok 2021-08-18 01:01:36 (16452): Guest Log: Mounting shared directory 2021-08-18 02:41:25 (16452): Status Report: Elapsed Time: '6000.000000' 2021-08-18 02:41:25 (16452): Status Report: CPU Time: '6000.937500' 2021-08-18 04:21:40 (16452): Status Report: Elapsed Time: '12000.000000' 2021-08-18 04:21:40 (16452): Status Report: CPU Time: '12002.625000' 2021-08-18 06:01:57 (16452): Status Report: Elapsed Time: '18000.000000' 2021-08-18 06:01:57 (16452): Status Report: CPU Time: '17995.812500' 2021-08-18 07:42:07 (16452): Status Report: Elapsed Time: '24000.000000' 2021-08-18 07:42:07 (16452): Status Report: CPU Time: '23998.875000' 2021-08-18 09:22:17 (16452): Status Report: Elapsed Time: '30000.000000' 2021-08-18 09:22:17 (16452): Status Report: CPU Time: '30002.203125' 2021-08-18 11:02:26 (16452): Status Report: Elapsed Time: '36000.000000' 2021-08-18 11:02:26 (16452): Status Report: CPU Time: '36005.656250' 2021-08-18 12:42:35 (16452): Status Report: Elapsed Time: '42000.000000' 2021-08-18 12:42:35 (16452): Status Report: CPU Time: '42009.125000' 2021-08-18 14:22:44 (16452): Status Report: Elapsed Time: '48000.000000' 2021-08-18 14:22:44 (16452): Status Report: CPU Time: '48012.171875' 2021-08-18 16:02:56 (16452): Status Report: Elapsed Time: '54000.000000' 2021-08-18 16:02:56 (16452): Status Report: CPU Time: '54014.734375' 2021-08-18 17:43:12 (16452): Status Report: Elapsed Time: '60000.000000' 2021-08-18 17:43:12 (16452): Status Report: CPU Time: '60015.000000' 2021-08-18 19:23:29 (16452): Status Report: Elapsed Time: '66000.000000' 2021-08-18 19:23:29 (16452): Status Report: CPU Time: '66015.500000' 2021-08-18 21:03:45 (16452): Status Report: Elapsed Time: '72000.000000' 2021-08-18 21:03:45 (16452): Status Report: CPU Time: '72014.781250' 2021-08-18 22:44:02 (16452): Status Report: Elapsed Time: '78000.000000' 2021-08-18 22:44:02 (16452): Status Report: CPU Time: '78013.890625' 2021-08-19 00:24:24 (16452): Status Report: Elapsed Time: '84000.000000' 2021-08-19 00:24:24 (16452): Status Report: CPU Time: '84003.593750' 2021-08-19 02:04:44 (16452): Status Report: Elapsed Time: '90000.000000' 2021-08-19 02:04:44 (16452): Status Report: CPU Time: '89999.718750' ID: 45226 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2413 Credit: 226,535,347 RAC: 131,647	Message 45227 - Posted: 19 Aug 2021, 5:41:34 UTC - in response to Message 45226. So far this task looks fine. Elapsed Time and CPU Time are very close together: 2021-08-19 02:04:44 (16452): Status Report: Elapsed Time: '90000.000000' 2021-08-19 02:04:44 (16452): Status Report: CPU Time: '89999.718750' Take a look at ATLAS Monitoring on console 2. It will show how many events are processed and how many are still open. If the task is running normal it writes to it's internal logfiles. They are then used by ATLAS Monitoring to calculate some numbers. Regarding the vm_image.vdi entries showing the yellow triangle. This is a problem of VirtualBox's device manager. It just tells you that the device manager did not remove the pointer to a vdi file that has already been deleted. Definitely not a network issue. More likely would be a communication problem or a timeout between VirtualBox, vboxwrapper and BOINC. ID: 45227 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2101 Credit: 159,819,191 RAC: 123,837	Message 45228 - Posted: 19 Aug 2021, 6:20:35 UTC - in response to Message 45227. Take a look at ATLAS Monitoring on console 2. This is not possible, because RDP is showing Kernel panic. Waiting atm for finishing of a second Atlas (your Console F2 say 158 of 200 collisions are done), a restart of Boinc and this faulty Atlas will be done. This are the last lines of the vbox.log atm: 24:51:57.510201 VRDP: Connection closed: 1 24:51:57.510328 VBVA: VRDP acceleration has been disabled. 31:11:25.437415 VRDP: New connection: 31:11:25.437592 VRDP: Connection opened (IPv6): 2 31:11:25.437765 VRDP: Negotiating security method with the client. 31:11:25.438705 VRDP: failed to access the server certificate file '': VERR_FILE_NOT_FOUND 31:11:25.438805 VRDP: Connection closed: 2 31:11:27.329274 VRDP: New connection: 31:11:27.329450 VRDP: Connection opened (IPv6): 3 31:11:27.329666 VRDP: Negotiating security method with the client. 31:11:27.341875 VRDP: Methods 0x0000001b 31:11:27.341941 VRDP: Channel: [rdpdr] [1004]. Accepted. 31:11:27.341957 VRDP: Channel: [rdpsnd] [1005]. Accepted. 31:11:27.341973 VRDP: Channel: [cliprdr] [1006]. Accepted. 31:11:27.341990 VRDP: Channel: [drdynvc] [1007]. Accepted. 31:11:27.342005 VRDP: Unsupported SEC_TAG: 0xC006/8. Skipping. 31:11:27.342020 VRDP: Unsupported SEC_TAG: 0xC00A/8. Skipping. 31:11:27.408733 VRDP: Client seems to be MSFT. 31:11:27.408777 VRDP: Logon: PCRYZEN9 (::1) build 19041. User: [] Domain: [] Screen: 0 31:11:27.408948 AUTH: User: []. Domain: []. Authentication type: [Null] 31:11:27.408963 AUTH: Access granted. 31:11:27.409828 VRDP: Enabling upstream audio. 31:11:27.409958 VBVA: VRDP acceleration has been requested. 31:11:27.414514 VMMDev: SetVideoModeHint: Got a video mode hint (1920x1080x32)@(0x0),(1;0) at 0 31:11:27.414652 VRDP: SunFlsh disabled. 31:11:27.455692 VRDP: SCARD enabled for 4 ID: 45228 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2101 Credit: 159,819,191 RAC: 123,837	Message 45230 - Posted: 20 Aug 2021, 2:33:57 UTC - in response to Message 45228. a restart of Boinc and this faulty Atlas will be done. After this the Atlas-Task finished correct. So, also no faulty. Will stop Atlas in Windows for a time. ID: 45230 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 807 Credit: 652,613,883 RAC: 278,273	Message 45283 - Posted: 3 Sep 2021, 17:17:02 UTC Last modified: 3 Sep 2021, 17:21:34 UTC I have some WU's that are long runners and the Alt-F2 does not do anything even after reboot. Are these legit VMs that just no one remembered to add the code? or Junk? ID: 45283 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 807 Credit: 652,613,883 RAC: 278,273	Message 45284 - Posted: 3 Sep 2021, 17:22:19 UTC - in response to Message 45283. Actually few minutes after reboot they were doing something, not sure what happened in the last 24hr though ID: 45284 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 675 Credit: 43,662,969 RAC: 15,952	Message 45285 - Posted: 3 Sep 2021, 17:51:26 UTC Last modified: 3 Sep 2021, 17:52:13 UTC The latest tasks I have been running take 25...50 hours when running on single core. The consoles seem to work normally. I haven't tried rebooting though. ID: 45285 · Reply Quote

tullio Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0	Message 45301 - Posted: 7 Sep 2021, 16:32:43 UTC They are taking about 6 hours on two cores (or processors?) of my Intel i5 9400F. It is a 3 cores CPU and six processors. OS is Windows 10. Tullio ID: 45301 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 675 Credit: 43,662,969 RAC: 15,952	Message 45302 - Posted: 7 Sep 2021, 17:58:14 UTC My latest ones (downloaded about 3 days ago) are taking about 16...18 hours on a single core. So they are now shorter. Unfortunately the credit has dropped even more, Per runtime second it is now about half of what it was last week. ID: 45302 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2101 Credit: 159,819,191 RAC: 123,837	Message 45409 - Posted: 29 Sep 2021, 17:26:38 UTC This WU stop running after 10 min. for all Computer: https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=172206360 ID: 45409 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2101 Credit: 159,819,191 RAC: 123,837	Message 45539 - Posted: 26 Oct 2021, 6:39:23 UTC - in response to Message 45409. [2021-10-26 08:27:39] 2021-10-26 06:27:39,440 [wrapper] apfmon messages muted [2021-10-26 08:27:39] * Error codes and diagnostics * [2021-10-26 08:27:39] "exeErrorCode": 39, [2021-10-26 08:27:39] "exeErrorDiag": "CVMFS DBRelease setup file /cvmfs/atlas.cern.ch/repo/sw/database/DBRelease/current/setup.py was not readable", [2021-10-26 08:27:39] "pilotErrorCode": 1305, [2021-10-26 08:27:39] "pilotErrorDiag": "Failed to execute payload", [2021-10-26 08:27:39] * Listing of results directory * [2021-10-26 08:27:39] insgesamt 261744 Atlas native-VM CentOS7. https://lhcathome.cern.ch/lhcathome/results.php?userid=75468&offset=0&show_names=0&state=5&appid= ID: 45539 · Reply Quote

LHC@home