Message boards : ATLAS application : ATLAS vbox v2.01
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 376
Credit: 13,902,740
RAC: 8,758
Message 46924 - Posted: 27 Jun 2022, 7:46:16 UTC

Hi all,

We have just release a new virtualbox version of the ATLAS app, v2.01.

The most significant change is the use of a new version of vboxwrapper which enables multiattach mode. In short this means there is no need to make a copy of the large vdi image file at the start of each task so tasks will start quicker.

For more technical details see the GitHub issue.

As always let us know if you see any issues.

David
ID: 46924 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 2050
Credit: 154,741,166
RAC: 140,079
Message 46925 - Posted: 27 Jun 2022, 8:02:44 UTC - in response to Message 46924.  
Last modified: 27 Jun 2022, 8:13:29 UTC

Mo 27 Jun 2022 10:00:00 CEST | LHC@home | No tasks are available for ATLAS Simulation

Did you restart all project server instances to make them aware of the new app version?

<edit>
Got a task.
Could have been a wrong pref or local setting.
</edit>
ID: 46925 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1622
Credit: 76,640,889
RAC: 245,927
Message 46926 - Posted: 27 Jun 2022, 8:22:14 UTC - in response to Message 46924.  
Last modified: 27 Jun 2022, 8:35:24 UTC

Tasks are downloaded Win11pro including 1.07 GByte .vdi (10:30 min. with 70 MBits)
vboxwrapper_26204_windows_x86_64.exe :-)
https://lhcathome.cern.ch/lhcathome/results.php?hostid=10631979
ID: 46926 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 2050
Credit: 154,741,166
RAC: 140,079
Message 46927 - Posted: 27 Jun 2022, 8:31:28 UTC

1st task (on Linux) started fine and is now processing events - will take a while.
Logfiles don't show unexpected issues.
ID: 46927 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1622
Credit: 76,640,889
RAC: 245,927
Message 46928 - Posted: 27 Jun 2022, 8:42:48 UTC - in response to Message 46926.  
Last modified: 27 Jun 2022, 9:03:13 UTC

RDP show no running Collision. 20 sec CPU-Time - 15 Min Run-Time.
Cancel it now!

Verschoben:VM environment needs to be cleaned up.


Have set prefs using Theory AND Atlas with Unlimited Tasks instead of 8.
Flooting with Theory Tasks, OMG!
ID: 46928 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1622
Credit: 76,640,889
RAC: 245,927
Message 46929 - Posted: 27 Jun 2022, 9:22:44 UTC - in response to Message 46928.  

The next Atlas show no RDP in Boinc.
Have stopped testing Atlas complete and running only Theory with the old wrapper.
ID: 46929 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 2050
Credit: 154,741,166
RAC: 140,079
Message 46930 - Posted: 27 Jun 2022, 9:35:05 UTC - in response to Message 46929.  

These tasks succeeded:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=358663450
https://lhcathome.cern.ch/lhcathome/result.php?resultid=358663452


Your issues are not caused by the new vboxwrapper nor by the new method using differencing images.
The logfiles clearly show there are network issues when the VM makes CVMFS requests.
2022-06-27 10:29:22 (2260): Guest Log: Checking CVMFS...
2022-06-27 10:29:24 (2260): Guest Log: Failed to check CVMFS, check output from cvmfs_config probe:
2022-06-27 10:29:24 (2260): Guest Log: Probing /cvmfs/atlas.cern.ch... Failed!
2022-06-27 10:29:24 (2260): Guest Log: Probing /cvmfs/atlas-condb.cern.ch... Failed!
2022-06-27 10:29:24 (2260): Guest Log: Probing /cvmfs/grid.cern.ch... Failed!


My guess would be that your router can't deal with the huge number of concurrently open connections and drops new connection requests.
ID: 46930 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1622
Credit: 76,640,889
RAC: 245,927
Message 46931 - Posted: 27 Jun 2022, 9:49:29 UTC - in response to Message 46930.  

PLEASE!
FATAL: Could not read from the boot medium! System halted.

Hold the ball flat...
ID: 46931 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1532
Credit: 49,352,865
RAC: 66,502
Message 46932 - Posted: 27 Jun 2022, 9:55:00 UTC
Last modified: 27 Jun 2022, 9:56:53 UTC

here (Windows10) the BOINC log says: "No tasks available for ATLAS simulation"
while Server Status shows 2.920 unsent tasks ... :-(

EDIT: receiving new tasks just now :-) including image.vdi v2.01
ID: 46932 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 2050
Credit: 154,741,166
RAC: 140,079
Message 46933 - Posted: 27 Jun 2022, 10:01:23 UTC - in response to Message 46931.  

This needs more details.

Is it on 1 computer or on many/all?
Describe the details immediately before the error happened.
#tasks, status (e.g. starting/running...), task types (LHC or other projects...)
ID: 46933 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 2050
Credit: 154,741,166
RAC: 140,079
Message 46937 - Posted: 27 Jun 2022, 15:02:53 UTC - in response to Message 46927.  

1st task (on Linux) started fine and is now processing events - will take a while.
Logfiles don't show unexpected issues.

The task succeeded.
https://lhcathome.cern.ch/lhcathome/result.php?resultid=358663313
2022-06-27 10:17:49 (68925): Detected: vboxwrapper 26204
.
.
.
2022-06-27 10:17:52 (68925): Adding virtual disk drive to VM. (ATLAS_vbox_2.01_image.vdi)
.
.
.
2022-06-27 16:54:00 (68925): Guest Log: HITS file was successfully produced
.
.
.
2022-06-27 16:54:01 (68925): Guest Log:  *** Success! Shutting down the machine. ***
ID: 46937 · Report as offensive     Reply Quote
Profile rbpeake

Send message
Joined: 17 Sep 04
Posts: 84
Credit: 26,144,350
RAC: 9,924
Message 46940 - Posted: 27 Jun 2022, 15:40:32 UTC

There seem to be a lot of errors generally. Mine have all errored-out, and typically several people before me have also errored-out. Because the errors occur within the first 20 minutes or so, a lot of the volume of work is not being successfully completed.
Regards,
Bob P.
ID: 46940 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 2050
Credit: 154,741,166
RAC: 140,079
Message 46941 - Posted: 27 Jun 2022, 16:10:33 UTC - in response to Message 46940.  

There is a rogue host in the list that crashes everything (even the older ATLAS tasks).
I informed CERN.


@rbpeake
Nonetheless your computer also crashes CMS which has not been changed.
You may run your work buffer dry and check your VirtualBox installation.
ID: 46941 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1132
Credit: 6,939,527
RAC: 642
Message 46942 - Posted: 27 Jun 2022, 16:45:51 UTC

A lot of tasks fail because they can't connect to CVMFS.
I've 4 running now OK, but 8 errors because of no connection.
The problem here is that those tasks will run until eternity, cause there is no check to shutdown the VM gracefully.
I noticed them while they did not use CPU, so I created a computation error or aborted them.

Example:
2022-06-27 18:30:54 (13232): Guest Log: Checking CVMFS...
2022-06-27 18:30:56 (13232): Guest Log: Failed to check CVMFS, check output from cvmfs_config probe:
2022-06-27 18:30:56 (13232): Guest Log: Probing /cvmfs/atlas.cern.ch... Failed!
2022-06-27 18:30:56 (13232): Guest Log: Probing /cvmfs/atlas-condb.cern.ch... Failed!
2022-06-27 18:30:56 (13232): Guest Log: Probing /cvmfs/grid.cern.ch... Failed!
ID: 46942 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1622
Credit: 76,640,889
RAC: 245,927
Message 46945 - Posted: 27 Jun 2022, 23:06:34 UTC - in response to Message 46942.  

Theory have a lot of short Tasks. All are connecting to CVMFS and using Squid without problems.
ID: 46945 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1622
Credit: 76,640,889
RAC: 245,927
Message 46946 - Posted: 28 Jun 2022, 0:08:22 UTC - in response to Message 46945.  
Last modified: 28 Jun 2022, 0:08:56 UTC

Seeing a lot of sigusr1 problems from different Server.
Theory AND Atlas. Win11pro.
ID: 46946 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1132
Credit: 6,939,527
RAC: 642
Message 46947 - Posted: 28 Jun 2022, 5:46:45 UTC

I suppose those connection errors are caused server side.
LHC's setting max_connections is maybe exceeded.
That's one side of the problem.
The other problem is that those failed-connection-tasks are running for ever.
The administrator should build in some connection retries and when still failing shut the VM down.
ID: 46947 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1622
Credit: 76,640,889
RAC: 245,927
Message 46948 - Posted: 28 Jun 2022, 6:02:00 UTC - in response to Message 46947.  
Last modified: 28 Jun 2022, 6:18:00 UTC

Have saved a Theory with PC shutdown over night for testing:
Since 15 min. waiting with last line:
grid.cern.ch: Waiting for delivery of SIGUSR1...
A new Theory Task on this PC started one hour ago with a starttime of 20 min.
Atlas have the same Traffic...
https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10408749
2022-06-28 08:11:34 (6312): VM is no longer is a running state. It is in 'lse, errorID=DevATA_DISKFULL message="Host system reported disk full. VM execution is suspended. You can resume after freeing some space"
'.
ID: 46948 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 2050
Credit: 154,741,166
RAC: 140,079
Message 46949 - Posted: 28 Jun 2022, 6:20:14 UTC - in response to Message 46947.  

These are the logfile lines from your recently failed tasks:
2022-06-28 07:00:34 (4532): Guest Log: Checking CVMFS...
2022-06-28 07:00:35 (4532): Guest Log: Failed to check CVMFS, check output from cvmfs_config probe:
2022-06-28 07:00:35 (4532): Guest Log: Probing /cvmfs/atlas.cern.ch... Failed!
2022-06-28 07:00:35 (4532): Guest Log: Probing /cvmfs/atlas-condb.cern.ch... Failed!
2022-06-28 07:00:36 (4532): Guest Log: Probing /cvmfs/grid.cern.ch... Failed!

2022-06-28 07:05:30 (6484): Guest Log: Checking CVMFS...
2022-06-28 07:05:31 (6484): Guest Log: Failed to check CVMFS, check output from cvmfs_config probe:
2022-06-28 07:05:31 (6484): Guest Log: Probing /cvmfs/atlas.cern.ch... Failed!
2022-06-28 07:05:31 (6484): Guest Log: Probing /cvmfs/atlas-condb.cern.ch... Failed!
2022-06-28 07:05:31 (6484): Guest Log: Probing /cvmfs/grid.cern.ch... Failed!


CVMFS (server side) is very robust.
Even if 1 server is unavailable the client tries all other servers from the list before it gives up.
This happens independently for each repository.

The pattern above points out a major network issue.
This can be on your side as well as on the Cloudflare/CERN side but since many other computers are running fine it's more likely the issue is on your side.

If you run a (Linux) CVMFS client inside your LAN you may manually run "cvmfs_config probe" from that client.
ID: 46949 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 2050
Credit: 154,741,166
RAC: 140,079
Message 46950 - Posted: 28 Jun 2022, 6:27:47 UTC - in response to Message 46948.  

Guess you highlighted the wrong line.
The real issue is this:
Host system reported disk full.
ID: 46950 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : ATLAS application : ATLAS vbox v2.01


©2022 CERN