Message boards : ATLAS application : Bad WUs?
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 8 · Next

AuthorMessage
keputnam

Send message
Joined: 27 Sep 04
Posts: 102
Credit: 7,086,947
RAC: 1,340
Message 45174 - Posted: 3 Aug 2021, 19:44:25 UTC
Last modified: 3 Aug 2021, 19:45:40 UTC

The last two ATLAS WUs that I have downloaded appear to be faulty

One ran for 3 days before I noticed that while I could access the VM, ALT-F2 had no effect at all I aborted that one

The latest one has now been running for about an hour and exhibits the same behavior

Also, while BOINC reports the WU as running Resource Monitor show VBox using no CPU at all

I have re-booted and upgraded to the latest VBox version with no change


TIA for any ideas or suggestions
ID: 45174 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,942,439
RAC: 137,308
Message 45175 - Posted: 3 Aug 2021, 20:31:27 UTC - in response to Message 45174.  

Unfortunately this task has no error log that can be checked:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=323061097

Be so kind as to post stderr.txt from the currently idle task here before you cancel that task.
ID: 45175 · Report as offensive     Reply Quote
keputnam

Send message
Joined: 27 Sep 04
Posts: 102
Credit: 7,086,947
RAC: 1,340
Message 45176 - Posted: 3 Aug 2021, 22:56:16 UTC

Already aborted the second Doesn't surprise me that there is no log file, the WUs never got any CPU time

But a third, previously downloaded and queued has started and is responding normally
ID: 45176 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,942,439
RAC: 137,308
Message 45178 - Posted: 4 Aug 2021, 7:05:38 UTC - in response to Message 45176.  

This time the task wrote an error log:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=323117065

The following lines show what causes it to fail.
The task could not contact relevant CVMFS repositories:
2021-08-03 11:29:51 (20508): Guest Log: Probing /cvmfs/atlas.cern.ch... Failed!
2021-08-03 11:29:51 (20508): Guest Log: Probing /cvmfs/atlas-condb.cern.ch... Failed!
2021-08-03 11:29:52 (20508): Guest Log: Probing /cvmfs/grid.cern.ch... Failed!

It's very unlikely that all of them fail at the very same moment since:
- each repository has a couple of fail-over servers spread around the world
- other users didn't have similar problems.


It's more likely that there was an issue regarding your internet access or an issue within your LAN.
ID: 45178 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,128,280
RAC: 105,358
Message 45205 - Posted: 15 Aug 2021, 10:10:36 UTC

https://lhcathome.cern.ch/lhcathome/result.php?resultid=323709669
2021-08-14 20:44:43 (16612): Status Report: Elapsed Time: '6000.000000'
2021-08-14 20:44:43 (16612): Status Report: CPU Time: '35.015625'
2021-08-14 22:24:55 (16612): Status Report: Elapsed Time: '12000.000000'
2021-08-14 22:24:55 (16612): Status Report: CPU Time: '47.625000'
2021-08-15 00:05:04 (16612): Status Report: Elapsed Time: '18000.000000'
2021-08-15 00:05:04 (16612): Status Report: CPU Time: '60.656250'
2021-08-15 01:45:12 (16612): Status Report: Elapsed Time: '24000.000000'
2021-08-15 01:45:12 (16612): Status Report: CPU Time: '73.750000'
2021-08-15 03:25:20 (16612): Status Report: Elapsed Time: '30000.000000'
2021-08-15 03:25:20 (16612): Status Report: CPU Time: '86.875000'
2021-08-15 05:05:29 (16612): Status Report: Elapsed Time: '36000.807667'
2021-08-15 05:05:29 (16612): Status Report: CPU Time: '99.953125'
2021-08-15 06:45:37 (16612): Status Report: Elapsed Time: '42000.807667'
2021-08-15 06:45:37 (16612): Status Report: CPU Time: '113.000000'
2021-08-15 08:25:46 (16612): Status Report: Elapsed Time: '48000.807667'
2021-08-15 08:25:46 (16612): Status Report: CPU Time: '125.218750'
2021-08-15 10:05:55 (16612): Status Report: Elapsed Time: '54000.807667'
2021-08-15 10:05:55 (16612): Status Report: CPU Time: '137.500000'
2021-08-15 11:46:04 (16612): Status Report: Elapsed Time: '60000.807667'
2021-08-15 11:46:04 (16612): Status Report: CPU Time: '150.109375'
2021-08-15 12:01:01 (16612): Powering off VM.
Is it possible to stop such a task from the System??
ID: 45205 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 45206 - Posted: 15 Aug 2021, 13:46:14 UTC - in response to Message 45205.  

Is it possible to stop such a task from the System??

What do you mean with: 'from the System':
By the server, your OS or a clean stop instead of an abort by you?
A clean stop can be done by hand or by script.

Using following script, you have to adjust boincpath (D:\Boinc1) folder to your needs.

@echo off
set "slotdir="
set /p "slotdir=In which slot-directory is the endless ATLAS-task running you want to stop gracefully? "
set boincpath="D:\Boinc1\slots\%slotdir%\shared"
copy /y NUL %boincpath%\atlas_done >NUL
exit
ID: 45206 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,128,280
RAC: 105,358
Message 45207 - Posted: 15 Aug 2021, 14:24:33 UTC - in response to Message 45206.  
Last modified: 15 Aug 2021, 14:36:01 UTC

What do you mean with: 'from the System':
By the server, your OS or a clean stop instead of an abort by you?
A clean stop can be done by hand or by script.

The task started without any work from Atlas. After 16 hours, have stopped this task manually.
Have no idea, how this task is running without any Atlas collission.
It's the first task showing this problem for me.
The Server can stop it, when no work for hours is seeing.
This is the number of successful Tasks before:
Anzahl der abgeschloßenen Aufgaben 1.786
ID: 45207 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,128,280
RAC: 105,358
Message 45210 - Posted: 16 Aug 2021, 17:10:57 UTC - in response to Message 45207.  

Wingman (Agile Boincers) finished today this task, so, no bad WU, but..this long running time without work with this Windows-task.
ID: 45210 · Report as offensive     Reply Quote
Richie_unstable

Send message
Joined: 26 Oct 18
Posts: 90
Credit: 4,188,598
RAC: 0
Message 45225 - Posted: 18 Aug 2021, 23:14:04 UTC

https://lhcathome.cern.ch/lhcathome/result.php?resultid=323711966

That one run for 100 hours on my host but after some point it wasn't doing actually anything. Now it looks like it never was doing anything but I saw it started up normally and begun crunching normally. I noticed it was gone some time ago already, but wanted to see if a miracle recovery would happen.
ID: 45225 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,128,280
RAC: 105,358
Message 45226 - Posted: 19 Aug 2021, 1:32:26 UTC - in response to Message 45225.  
Last modified: 19 Aug 2021, 2:08:21 UTC

Virtualbox under Windows (6.1.12 and 6.1.26):
vm_image.vdi are showing yellow triangle (CMS AND ATLAS!)
It seem a networking problem, when the disconnect of a task is not stopped correct.
After this, there are problems with new tasks running well.

https://lhcathome.cern.ch/lhcathome/result.php?resultid=323867357
This is the result of the stderr.txt atm.
Seem also finding no end.
2021-08-18 01:01:29 (16452): Guest Log: Checking CVMFS...

2021-08-18 01:01:36 (16452): Guest Log: CVMFS is ok

2021-08-18 01:01:36 (16452): Guest Log: Mounting shared directory

2021-08-18 02:41:25 (16452): Status Report: Elapsed Time: '6000.000000'
2021-08-18 02:41:25 (16452): Status Report: CPU Time: '6000.937500'
2021-08-18 04:21:40 (16452): Status Report: Elapsed Time: '12000.000000'
2021-08-18 04:21:40 (16452): Status Report: CPU Time: '12002.625000'
2021-08-18 06:01:57 (16452): Status Report: Elapsed Time: '18000.000000'
2021-08-18 06:01:57 (16452): Status Report: CPU Time: '17995.812500'
2021-08-18 07:42:07 (16452): Status Report: Elapsed Time: '24000.000000'
2021-08-18 07:42:07 (16452): Status Report: CPU Time: '23998.875000'
2021-08-18 09:22:17 (16452): Status Report: Elapsed Time: '30000.000000'
2021-08-18 09:22:17 (16452): Status Report: CPU Time: '30002.203125'
2021-08-18 11:02:26 (16452): Status Report: Elapsed Time: '36000.000000'
2021-08-18 11:02:26 (16452): Status Report: CPU Time: '36005.656250'
2021-08-18 12:42:35 (16452): Status Report: Elapsed Time: '42000.000000'
2021-08-18 12:42:35 (16452): Status Report: CPU Time: '42009.125000'
2021-08-18 14:22:44 (16452): Status Report: Elapsed Time: '48000.000000'
2021-08-18 14:22:44 (16452): Status Report: CPU Time: '48012.171875'
2021-08-18 16:02:56 (16452): Status Report: Elapsed Time: '54000.000000'
2021-08-18 16:02:56 (16452): Status Report: CPU Time: '54014.734375'
2021-08-18 17:43:12 (16452): Status Report: Elapsed Time: '60000.000000'
2021-08-18 17:43:12 (16452): Status Report: CPU Time: '60015.000000'
2021-08-18 19:23:29 (16452): Status Report: Elapsed Time: '66000.000000'
2021-08-18 19:23:29 (16452): Status Report: CPU Time: '66015.500000'
2021-08-18 21:03:45 (16452): Status Report: Elapsed Time: '72000.000000'
2021-08-18 21:03:45 (16452): Status Report: CPU Time: '72014.781250'
2021-08-18 22:44:02 (16452): Status Report: Elapsed Time: '78000.000000'
2021-08-18 22:44:02 (16452): Status Report: CPU Time: '78013.890625'
2021-08-19 00:24:24 (16452): Status Report: Elapsed Time: '84000.000000'
2021-08-19 00:24:24 (16452): Status Report: CPU Time: '84003.593750'
2021-08-19 02:04:44 (16452): Status Report: Elapsed Time: '90000.000000'
2021-08-19 02:04:44 (16452): Status Report: CPU Time: '89999.718750'
ID: 45226 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,942,439
RAC: 137,308
Message 45227 - Posted: 19 Aug 2021, 5:41:34 UTC - in response to Message 45226.  

So far this task looks fine.
Elapsed Time and CPU Time are very close together:
2021-08-19 02:04:44 (16452): Status Report: Elapsed Time: '90000.000000'
2021-08-19 02:04:44 (16452): Status Report: CPU Time: '89999.718750'



Take a look at ATLAS Monitoring on console 2.
It will show how many events are processed and how many are still open.
If the task is running normal it writes to it's internal logfiles.
They are then used by ATLAS Monitoring to calculate some numbers.


Regarding the vm_image.vdi entries showing the yellow triangle.
This is a problem of VirtualBox's device manager.
It just tells you that the device manager did not remove the pointer to a vdi file that has already been deleted.
Definitely not a network issue.
More likely would be a communication problem or a timeout between VirtualBox, vboxwrapper and BOINC.
ID: 45227 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,128,280
RAC: 105,358
Message 45228 - Posted: 19 Aug 2021, 6:20:35 UTC - in response to Message 45227.  


Take a look at ATLAS Monitoring on console 2.

This is not possible, because RDP is showing Kernel panic.
Waiting atm for finishing of a second Atlas (your Console F2 say 158 of 200 collisions are done),
a restart of Boinc and this faulty Atlas will be done.
This are the last lines of the vbox.log atm:
24:51:57.510201 VRDP: Connection closed: 1
24:51:57.510328 VBVA: VRDP acceleration has been disabled.
31:11:25.437415 VRDP: New connection:
31:11:25.437592 VRDP: Connection opened (IPv6): 2
31:11:25.437765 VRDP: Negotiating security method with the client.
31:11:25.438705 VRDP: failed to access the server certificate file '': VERR_FILE_NOT_FOUND
31:11:25.438805 VRDP: Connection closed: 2
31:11:27.329274 VRDP: New connection:
31:11:27.329450 VRDP: Connection opened (IPv6): 3
31:11:27.329666 VRDP: Negotiating security method with the client.
31:11:27.341875 VRDP: Methods 0x0000001b
31:11:27.341941 VRDP: Channel: [rdpdr] [1004]. Accepted.
31:11:27.341957 VRDP: Channel: [rdpsnd] [1005]. Accepted.
31:11:27.341973 VRDP: Channel: [cliprdr] [1006]. Accepted.
31:11:27.341990 VRDP: Channel: [drdynvc] [1007]. Accepted.
31:11:27.342005 VRDP: Unsupported SEC_TAG: 0xC006/8. Skipping.
31:11:27.342020 VRDP: Unsupported SEC_TAG: 0xC00A/8. Skipping.
31:11:27.408733 VRDP: Client seems to be MSFT.
31:11:27.408777 VRDP: Logon: PCRYZEN9 (::1) build 19041. User: [] Domain: [] Screen: 0
31:11:27.408948 AUTH: User: []. Domain: []. Authentication type: [Null]
31:11:27.408963 AUTH: Access granted.
31:11:27.409828 VRDP: Enabling upstream audio.
31:11:27.409958 VBVA: VRDP acceleration has been requested.
31:11:27.414514 VMMDev: SetVideoModeHint: Got a video mode hint (1920x1080x32)@(0x0),(1;0) at 0
31:11:27.414652 VRDP: SunFlsh disabled.
31:11:27.455692 VRDP: SCARD enabled for 4
ID: 45228 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,128,280
RAC: 105,358
Message 45230 - Posted: 20 Aug 2021, 2:33:57 UTC - in response to Message 45228.  

a restart of Boinc and this faulty Atlas will be done.

After this the Atlas-Task finished correct.
So, also no faulty.
Will stop Atlas in Windows for a time.
ID: 45230 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 798
Credit: 644,753,329
RAC: 233,078
Message 45283 - Posted: 3 Sep 2021, 17:17:02 UTC
Last modified: 3 Sep 2021, 17:21:34 UTC

I have some WU's that are long runners and the Alt-F2 does not do anything even after reboot.

Are these legit VMs that just no one remembered to add the code? or Junk?
ID: 45283 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 798
Credit: 644,753,329
RAC: 233,078
Message 45284 - Posted: 3 Sep 2021, 17:22:19 UTC - in response to Message 45283.  

Actually few minutes after reboot they were doing something, not sure what happened in the last 24hr though
ID: 45284 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 674
Credit: 43,152,472
RAC: 15,698
Message 45285 - Posted: 3 Sep 2021, 17:51:26 UTC
Last modified: 3 Sep 2021, 17:52:13 UTC

The latest tasks I have been running take 25...50 hours when running on single core. The consoles seem to work normally. I haven't tried rebooting though.
ID: 45285 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 45301 - Posted: 7 Sep 2021, 16:32:43 UTC

They are taking about 6 hours on two cores (or processors?) of my Intel i5 9400F. It is a 3 cores CPU and six processors. OS is Windows 10.
Tullio
ID: 45301 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 674
Credit: 43,152,472
RAC: 15,698
Message 45302 - Posted: 7 Sep 2021, 17:58:14 UTC

My latest ones (downloaded about 3 days ago) are taking about 16...18 hours on a single core. So they are now shorter. Unfortunately the credit has dropped even more, Per runtime second it is now about half of what it was last week.
ID: 45302 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,128,280
RAC: 105,358
Message 45409 - Posted: 29 Sep 2021, 17:26:38 UTC

This WU stop running after 10 min. for all Computer:
https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=172206360
ID: 45409 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,128,280
RAC: 105,358
Message 45539 - Posted: 26 Oct 2021, 6:39:23 UTC - in response to Message 45409.  

[2021-10-26 08:27:39] 2021-10-26 06:27:39,440 [wrapper] apfmon messages muted
[2021-10-26 08:27:39] *** Error codes and diagnostics ***
[2021-10-26 08:27:39] "exeErrorCode": 39,
[2021-10-26 08:27:39] "exeErrorDiag": "CVMFS DBRelease setup file /cvmfs/atlas.cern.ch/repo/sw/database/DBRelease/current/setup.py was not readable",
[2021-10-26 08:27:39] "pilotErrorCode": 1305,
[2021-10-26 08:27:39] "pilotErrorDiag": "Failed to execute payload",
[2021-10-26 08:27:39] *** Listing of results directory ***
[2021-10-26 08:27:39] insgesamt 261744
Atlas native-VM CentOS7.
https://lhcathome.cern.ch/lhcathome/results.php?userid=75468&offset=0&show_names=0&state=5&appid=
ID: 45539 · Report as offensive     Reply Quote
1 · 2 · 3 · 4 . . . 8 · Next

Message boards : ATLAS application : Bad WUs?


©2024 CERN