Message boards : ATLAS application : New version of ATLAS pilot code
Message board moderation

To post messages, you must log in.

AuthorMessage
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 387
Credit: 15,314,184
RAC: 0
Message 39519 - Posted: 8 Aug 2019, 8:52:09 UTC

Hi all,

Today we started using a new version of the ATLAS pilot code - this is the code which controls the execution of the actual simulation task. You should not notice any difference to the tasks themselves, the only thing you may notice is different messages in the logs.

This new code has already been tested on the dev project, but as usual, let us know of any problems that you see.
ID: 39519 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 387
Credit: 15,314,184
RAC: 0
Message 39541 - Posted: 9 Aug 2019, 12:58:29 UTC

There was a problem affecting native tasks where if a task restarted it would immediately fail due to leftovers from the previous run. Version 2.60 was just released which cleans up properly when a job is restarted and avoids this problem. Vbox tasks should not be affected.
ID: 39541 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,192,791
RAC: 103,819
Message 39552 - Posted: 9 Aug 2019, 23:50:27 UTC
Last modified: 9 Aug 2019, 23:51:58 UTC

VM-Console with RDP shows with ALT+F2 no collisions, only the first line
Event processing information will appear here
ALT+F3 is ok.
In Boinc under Windows.
ID: 39552 · Report as offensive     Reply Quote
Henry Nebrensky

Send message
Joined: 13 Jul 05
Posts: 165
Credit: 14,925,288
RAC: 34
Message 39553 - Posted: 10 Aug 2019, 2:38:45 UTC - in response to Message 39541.  

Version 2.60 was just released ...
Was it tested on SL6? hostid=10563873 started getting validation errors with the 2.60 tasks...
ID: 39553 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 387
Credit: 15,314,184
RAC: 0
Message 39576 - Posted: 12 Aug 2019, 9:22:29 UTC - in response to Message 39553.  

Version 2.60 was just released ...
Was it tested on SL6? hostid=10563873 started getting validation errors with the 2.60 tasks...


I suspect this is related to the new pilot version rather than v2.60.

OS:Scientific Linux release 6.4 (Carbon)


This is a rather old version of SLC6 (latest is 6.10) so it could be that it is not supported by the new pilot. Could you try to check in the log....job.log.1 file for errors?
ID: 39576 · Report as offensive     Reply Quote
Henry Nebrensky

Send message
Joined: 13 Jul 05
Posts: 165
Credit: 14,925,288
RAC: 34
Message 39577 - Posted: 12 Aug 2019, 10:13:10 UTC - in response to Message 39576.  

Hi, thanks for looking into this...

Version 2.60 was just released ...
Was it tested on SL6? hostid=10563873 started getting validation errors with the 2.60 tasks...
I suspect this is related to the new pilot version rather than v2.60.

Possibly, since the last few 2.59 failed as well, but it's definitely something that kicked in only late on Friday

OS:Scientific Linux release 6.4 (Carbon)

This is a rather old version of SLC6 (latest is 6.10) so it could be that it is not supported by the new pilot. Could you try to check in the log....job.log.1 file for errors?


Can you remind me where that lives please?

(It may be a couple of days before I can efficiently get some more Atlas jobs on that host and watch them, though)
ID: 39577 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 387
Credit: 15,314,184
RAC: 0
Message 39578 - Posted: 12 Aug 2019, 11:56:30 UTC - in response to Message 39577.  


OS:Scientific Linux release 6.4 (Carbon)

This is a rather old version of SLC6 (latest is 6.10) so it could be that it is not supported by the new pilot. Could you try to check in the log....job.log.1 file for errors?


Can you remind me where that lives please?

(It may be a couple of days before I can efficiently get some more Atlas jobs on that host and watch them, though)


On the host you linked above it would be in the directory /data/henry/BOINC-HN/slots/n/ where n is the slot number.

I tested a task on an SLC6 (6.10) machine here at CERN and it ran ok: https://lhcathome.cern.ch/lhcathome/result.php?resultid=238971823

At least it started crunching but I had to abort it after a few minutes since I'm not supposed to run heavy tasks there :) Your tasks all exited after a few seconds so I suspect something in the setup phase. I saw similar things in a SLC6.9 host from our top cruncher AGLT2 so I will also investigate with them.
ID: 39578 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,192,791
RAC: 103,819
Message 39579 - Posted: 12 Aug 2019, 12:27:17 UTC

ID: 39579 · Report as offensive     Reply Quote
Henry Nebrensky

Send message
Joined: 13 Jul 05
Posts: 165
Credit: 14,925,288
RAC: 34
Message 39605 - Posted: 14 Aug 2019, 10:20:49 UTC - in response to Message 39578.  

(It may be a couple of days before I can efficiently get some more Atlas jobs on that host and watch them, though)
With the usual irony, the Atlas task I pulled down on Monday managed to get itself to the front of the queue and started running...

It's only a single-core one but has already clocked up over 20 hours... so I'm now concerned about it for the opposite reason!
running: iteration=7570 pid=24722 exit_code=None
- is there a way to tell what the target iteration count is?

(Next I need to reboot the machine for a kernel update and will then see if/how 8-core jobs get on.)
ID: 39605 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 223,035,915
RAC: 136,691
Message 39606 - Posted: 14 Aug 2019, 11:29:26 UTC - in response to Message 39605.  

According to the log it's a native task.
To get the status of all native tasks a client is currently running you may open a console window and cd to the slots folder.
Then run the following command (including all spaces and quotes):
watch -n10 "find . \( -name \"log.EVNTtoHITS\" -o -name \"AthenaMP.log\" \) |sort |xargs -I {} -n1 sh -c \"egrep 'INFO.*Event nr. ' {} |tail -n1\""

The output should look like this:
Every 10,0s: find . \( -name "log.EVNTtoHITS" -o -name "AthenaMP.log" \) |sort |xargs -I {} -n1 sh -c "egrep 'INFO.*Event n...  hostname: Wed Aug 14 13:16:05 2019

2019-08-14 13:15:46,358 ISFG4SimSvc          INFO        Event nr. 72 took 87.5 s. New average 133.5 +- 8.286

Tasks running as singlecore are finished a few minutes after 200 events are processed.
n-core tasks will show n lines and are finished after the sum of events has reached 200.

As ATLAS native doesn't support snapshots every cancelled task would restart from the scratch.
You may keep that in mind before you reboot the machine.
ID: 39606 · Report as offensive     Reply Quote
Henry Nebrensky

Send message
Joined: 13 Jul 05
Posts: 165
Credit: 14,925,288
RAC: 34
Message 39607 - Posted: 14 Aug 2019, 12:23:58 UTC - in response to Message 39606.  

According to the log it's a native task.
To get the status of all native tasks a client is currently running you may open a console window and cd to the slots folder.
Then run the following command (including all spaces and quotes):
watch -n10 "find . \( -name \"log.EVNTtoHITS\" -o -name \"AthenaMP.log\" \) |sort |xargs -I {} -n1 sh -c \"egrep 'INFO.*Event nr. ' {} |tail -n1\""

Found it, thanks!
3:01:23 ISFG4SimSvc          INFO       Event nr. 154 took 694.4 s. New average 538.8 +- 17.1

As ATLAS native doesn't support snapshots every cancelled task would restart from the scratch.
You may keep that in mind before you reboot the machine.

I know - that's why I was hoping it would finish today when I've got time to play with that machine. Looks like it's going to be another 7 hours, so another couple of days before I have another chance to play about with it.
ID: 39607 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 387
Credit: 15,314,184
RAC: 0
Message 39617 - Posted: 15 Aug 2019, 12:33:14 UTC - in response to Message 39552.  

VM-Console with RDP shows with ALT+F2 no collisions, only the first line.


This was due to a change in the internal directory structure of the running tasks in the new pilot, so the script finding the times per event wasn't working. It's fixed now and should be working for all WU in a few hours.
ID: 39617 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,192,791
RAC: 103,819
Message 39618 - Posted: 15 Aug 2019, 16:51:11 UTC - in response to Message 39617.  

The Collisions are back in RDP.
Thank you, David.
ID: 39618 · Report as offensive     Reply Quote
Henry Nebrensky

Send message
Joined: 13 Jul 05
Posts: 165
Credit: 14,925,288
RAC: 34
Message 39654 - Posted: 19 Aug 2019, 16:04:04 UTC - in response to Message 39607.  

... I was hoping it would finish today when I've got time to play with that machine. Looks like it's going to be another 7 hours, so another couple of days before I have another chance to play about with it.
It took a while to get a steady stream of Atlas-native tasks running again on that SL6 machine, but they do look to be finishing correctly now.

Unfortunately it's not a good use of my time to upgrade that machine, so I'll put it back on Sixtrack when Atlas moves off SL6. I already have a CentOS 7 box also running Atlas-native anyway.
ID: 39654 · Report as offensive     Reply Quote

Message boards : ATLAS application : New version of ATLAS pilot code


©2024 CERN