Message boards : ATLAS application : Bad WUs?
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 8 · Next

AuthorMessage
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 45540 - Posted: 26 Oct 2021, 7:17:17 UTC - in response to Message 45539.  

ID: 45540 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,335,359
RAC: 102,469
Message 45541 - Posted: 26 Oct 2021, 7:36:00 UTC

tasks error out after about 10 minutes, like this:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=331216653

What's the problem?
ID: 45541 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,083,451
RAC: 105,833
Message 45542 - Posted: 26 Oct 2021, 7:42:22 UTC - in response to Message 45540.  

https://lhcathome.cern.ch/lhcathome/results.php?userid=75468&offset=0&show_names=0&state=5&appid=
No access to that link for users not logged in as 'maeax'

Maybe you mean these results: https://lhcathome.cern.ch/lhcathome/results.php?hostid=10618519&offset=0&show_names=0&state=5&appid=

Sorry,
morning, morning.. Yes. Erich56 have the same issue!
ID: 45542 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 45543 - Posted: 26 Oct 2021, 7:51:43 UTC - in response to Message 45542.  

ID: 45543 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,083,451
RAC: 105,833
Message 45544 - Posted: 26 Oct 2021, 7:58:48 UTC - in response to Message 45543.  
Last modified: 26 Oct 2021, 8:09:40 UTC

[2021-10-26 08:27:39] "exeErrorDiag": "CVMFS DBRelease setup file /cvmfs/atlas.cern.ch/repo/sw/database/DBRelease/current/setup.py was not readable",

Normaly 5k but now 26058 unsend Atlas-Tasks!
ID: 45544 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,872,986
RAC: 137,260
Message 45545 - Posted: 26 Oct 2021, 8:09:02 UTC
Last modified: 26 Oct 2021, 8:12:41 UTC

Meanwhile mine are also affected.
Sent a mail to David Cameron.

It looks like a link to the directory
/cvmfs/atlas.cern.ch/repo/sw/database/DBRelease/current/
is missing on the CVMFS repository.
That link should point to the most recent DBRelease directory.

<edit>
Better:
A link pointing from
/cvmfs/atlas.cern.ch/repo/sw/database/DBRelease/current/
to
the most recent DBRelease directory.
</edit>
ID: 45545 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,083,451
RAC: 105,833
Message 45546 - Posted: 26 Oct 2021, 8:10:38 UTC - in response to Message 45544.  

Normaly 5k but now 26058 unsend Atlas-Tasks!
ID: 45546 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,872,986
RAC: 137,260
Message 45547 - Posted: 26 Oct 2021, 8:26:43 UTC - in response to Message 45546.  

And a corresponding high number of ATLAS tasks in progress, which may be the reason why the ATLAS download speed dropped to poor 25% of the usual speed this morning.
ID: 45547 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,083,451
RAC: 105,833
Message 45548 - Posted: 26 Oct 2021, 8:41:00 UTC - in response to Message 45547.  

We are poor User of Tier3. Maybe the traffic is going into T0-T2 now 33k Atlas-tasks in use.
ID: 45548 · Report as offensive     Reply Quote
mmonnin

Send message
Joined: 22 Mar 17
Posts: 55
Credit: 10,223,976
RAC: 2,477
Message 45549 - Posted: 26 Oct 2021, 9:41:27 UTC

1.1GB and 250MB downloads just for them to be invalid in 3 min.

The download speed is slow as people are constantly downloading huge files. I had a PC with plenty of work last night now waiting on downloads as they were completing before more work could be downloaded.
ID: 45549 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,083,451
RAC: 105,833
Message 45550 - Posted: 26 Oct 2021, 9:47:55 UTC - in response to Message 45549.  

Stopping download of Atlas is atm the best option, until there is an answer from Cern-IT.
ID: 45550 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,083,451
RAC: 105,833
Message 45551 - Posted: 26 Oct 2021, 10:20:25 UTC - in response to Message 45550.  

2021-10-26 12:09:08 (22368): Guest Log: No HITS file was produced
2021-10-26 12:09:08 (22368): Guest Log: Successfully finished the ATLAS job!
https://lhcathome.cern.ch/lhcathome/result.php?resultid=331199612
Completed Atlas-tasks don't generate a Hits-File!
ID: 45551 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 387
Credit: 15,314,184
RAC: 0
Message 45552 - Posted: 26 Oct 2021, 10:25:47 UTC

There was a clean up this morning of some “legacy” files on cvmfs, and it turns out those were not legacy at all but used by most atlas tasks. This has just been rolled back but it may take a little while to propagate to cvmfs clients. Sorry for this unforeseen mess.
ID: 45552 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,872,986
RAC: 137,260
Message 45553 - Posted: 26 Oct 2021, 10:43:15 UTC - in response to Message 45552.  

At least the missing link is back on CVMFS.
Just got an ATLAS task that started fine.
ID: 45553 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,335,359
RAC: 102,469
Message 45554 - Posted: 26 Oct 2021, 12:45:09 UTC - in response to Message 45546.  

Maeax wrote this morning:
Normaly 5k but now 26058 unsend Atlas-Tasks!
now, in the afternoon, the project status page shows 45.753 unsent tasks
ID: 45554 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,083,451
RAC: 105,833
Message 45555 - Posted: 26 Oct 2021, 13:11:44 UTC - in response to Message 45554.  
Last modified: 26 Oct 2021, 13:14:03 UTC

We as Tier3 are too small to do so many work.
The Scheduler can handle hundredthousands of sixtracks, so
AgileBoincers or MPI and other Instituts are doing this in the next time.
https://lhcathome.cern.ch/lhcathome/top_users.php
ID: 45555 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 45556 - Posted: 26 Oct 2021, 13:42:49 UTC - in response to Message 45554.  

Maeax wrote this morning:
Normaly 5k but now 26058 unsend Atlas-Tasks!
now, in the afternoon, the project status page shows 45.753 unsent tasks

I suppose most of those unsent tasks are resends because of the initial validate errors.
ID: 45556 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,335,359
RAC: 102,469
Message 45567 - Posted: 28 Oct 2021, 11:36:18 UTC

in the recent past, I received WUs which maybe were misconfigured, like this one:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=331283437

As seen, the CPU was utilized only for 1,5 minutes, and for the remaining time the WU ran "idle".

Unfortunately, I did not notice it immediately, but only after the BOINC manager was showing an unusual long runtime. A check with the Windows task manager showed that there was CPU usage only for 3 instead of 4 WUs (I run 4 WUs 3 cores ea. concurrently).
Further, the VM console could not be opened; however, the VM_image.vdi was still in the slot directory.
Hence, I aborted the WU manually via the BOINC manager.

It is too bad that in such a case the WU does not stop automatically (same problem, BTW, exists with faulty Theory WUs - they would continue running forever if one does not notice in time that something is wrong).
ID: 45567 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,083,451
RAC: 105,833
Message 45568 - Posted: 28 Oct 2021, 12:05:01 UTC - in response to Message 45567.  
Last modified: 28 Oct 2021, 12:15:32 UTC

This was the reason for me to stop Atlas for Windows in the past.
Sometime the start stocks and noone know why the idle-phase isn't stopped.
Native-VM haven't this problem.
Yes, there is some watching for Atlas, but also for Theory (Sherpa for example).
Edit: vboxwrapper for Atlas: 2021-10-28 05:25:09 (8928): Detected: vboxwrapper 26197
vboxwrapper for CMS: 2021-10-24 20:58:14 (7992): Detected: vboxwrapper 26202
CMS-vboxwrapper was changed from Laurence in the last weeks.
Don't know if it is helpfull.
ID: 45568 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,335,359
RAC: 102,469
Message 45747 - Posted: 25 Nov 2021, 15:55:44 UTC
Last modified: 25 Nov 2021, 15:58:03 UTC

within the past few hours, a had several task in a row, where CPU usage was less than 1 minute, but the task was running forever.
See here:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=334449376

if I interpret the stderr correctly, the problem was

2021-11-25 14:19:01 (27532): Guest Log: 00:00:10.004806 timesync vgsvcTimeSyncWorker: Radical guest time change: -3 587 877 475 000ns (GuestNow=1 637 846 340 764 851 000 ns GuestLast=1 637 849 928 642 326 000 ns fSetTimeLastLoop=true )

has anyone else made same experience?
ID: 45747 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 . . . 8 · Next

Message boards : ATLAS application : Bad WUs?


©2024 CERN