Message boards :
ATLAS application :
New version of ATLAS pilot code
Message board moderation
Author | Message |
---|---|
Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0 |
Hi all, Today we started using a new version of the ATLAS pilot code - this is the code which controls the execution of the actual simulation task. You should not notice any difference to the tasks themselves, the only thing you may notice is different messages in the logs. This new code has already been tested on the dev project, but as usual, let us know of any problems that you see. |
Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0 |
There was a problem affecting native tasks where if a task restarted it would immediately fail due to leftovers from the previous run. Version 2.60 was just released which cleans up properly when a job is restarted and avoids this problem. Vbox tasks should not be affected. |
Send message Joined: 2 May 07 Posts: 2071 Credit: 156,192,791 RAC: 103,819 |
VM-Console with RDP shows with ALT+F2 no collisions, only the first line Event processing information will appear here ALT+F3 is ok. In Boinc under Windows. |
Send message Joined: 13 Jul 05 Posts: 165 Credit: 14,925,288 RAC: 34 |
Version 2.60 was just released ...Was it tested on SL6? hostid=10563873 started getting validation errors with the 2.60 tasks... |
Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0 |
Version 2.60 was just released ...Was it tested on SL6? hostid=10563873 started getting validation errors with the 2.60 tasks... I suspect this is related to the new pilot version rather than v2.60. OS:Scientific Linux release 6.4 (Carbon) This is a rather old version of SLC6 (latest is 6.10) so it could be that it is not supported by the new pilot. Could you try to check in the log....job.log.1 file for errors? |
Send message Joined: 13 Jul 05 Posts: 165 Credit: 14,925,288 RAC: 34 |
Hi, thanks for looking into this... I suspect this is related to the new pilot version rather than v2.60.Version 2.60 was just released ...Was it tested on SL6? hostid=10563873 started getting validation errors with the 2.60 tasks... Possibly, since the last few 2.59 failed as well, but it's definitely something that kicked in only late on Friday OS:Scientific Linux release 6.4 (Carbon) Can you remind me where that lives please? (It may be a couple of days before I can efficiently get some more Atlas jobs on that host and watch them, though) |
Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0 |
On the host you linked above it would be in the directory /data/henry/BOINC-HN/slots/n/ where n is the slot number. I tested a task on an SLC6 (6.10) machine here at CERN and it ran ok: https://lhcathome.cern.ch/lhcathome/result.php?resultid=238971823 At least it started crunching but I had to abort it after a few minutes since I'm not supposed to run heavy tasks there :) Your tasks all exited after a few seconds so I suspect something in the setup phase. I saw similar things in a SLC6.9 host from our top cruncher AGLT2 so I will also investigate with them. |
Send message Joined: 2 May 07 Posts: 2071 Credit: 156,192,791 RAC: 103,819 |
This is a SL610 as a VM running Atlas-native: https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10496403 |
Send message Joined: 13 Jul 05 Posts: 165 Credit: 14,925,288 RAC: 34 |
(It may be a couple of days before I can efficiently get some more Atlas jobs on that host and watch them, though)With the usual irony, the Atlas task I pulled down on Monday managed to get itself to the front of the queue and started running... It's only a single-core one but has already clocked up over 20 hours... so I'm now concerned about it for the opposite reason! running: iteration=7570 pid=24722 exit_code=None- is there a way to tell what the target iteration count is? (Next I need to reboot the machine for a kernel update and will then see if/how 8-core jobs get on.) |
Send message Joined: 15 Jun 08 Posts: 2386 Credit: 223,035,915 RAC: 136,691 |
According to the log it's a native task. To get the status of all native tasks a client is currently running you may open a console window and cd to the slots folder. Then run the following command (including all spaces and quotes): watch -n10 "find . \( -name \"log.EVNTtoHITS\" -o -name \"AthenaMP.log\" \) |sort |xargs -I {} -n1 sh -c \"egrep 'INFO.*Event nr. ' {} |tail -n1\"" The output should look like this: Every 10,0s: find . \( -name "log.EVNTtoHITS" -o -name "AthenaMP.log" \) |sort |xargs -I {} -n1 sh -c "egrep 'INFO.*Event n... hostname: Wed Aug 14 13:16:05 2019 2019-08-14 13:15:46,358 ISFG4SimSvc INFO Event nr. 72 took 87.5 s. New average 133.5 +- 8.286 Tasks running as singlecore are finished a few minutes after 200 events are processed. n-core tasks will show n lines and are finished after the sum of events has reached 200. As ATLAS native doesn't support snapshots every cancelled task would restart from the scratch. You may keep that in mind before you reboot the machine. |
Send message Joined: 13 Jul 05 Posts: 165 Credit: 14,925,288 RAC: 34 |
According to the log it's a native task. Found it, thanks! 3:01:23 ISFG4SimSvc INFO Event nr. 154 took 694.4 s. New average 538.8 +- 17.1 As ATLAS native doesn't support snapshots every cancelled task would restart from the scratch. I know - that's why I was hoping it would finish today when I've got time to play with that machine. Looks like it's going to be another 7 hours, so another couple of days before I have another chance to play about with it. |
Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0 |
VM-Console with RDP shows with ALT+F2 no collisions, only the first line. This was due to a change in the internal directory structure of the running tasks in the new pilot, so the script finding the times per event wasn't working. It's fixed now and should be working for all WU in a few hours. |
Send message Joined: 2 May 07 Posts: 2071 Credit: 156,192,791 RAC: 103,819 |
The Collisions are back in RDP. Thank you, David. |
Send message Joined: 13 Jul 05 Posts: 165 Credit: 14,925,288 RAC: 34 |
... I was hoping it would finish today when I've got time to play with that machine. Looks like it's going to be another 7 hours, so another couple of days before I have another chance to play about with it.It took a while to get a steady stream of Atlas-native tasks running again on that SL6 machine, but they do look to be finishing correctly now. Unfortunately it's not a good use of my time to upgrade that machine, so I'll put it back on Sixtrack when Atlas moves off SL6. I already have a CentOS 7 box also running Atlas-native anyway. |
©2024 CERN