Message boards :
ATLAS application :
Faulty Box or Faulty WUs ?
Message board moderation
Author | Message |
---|---|
Send message Joined: 2 Sep 04 Posts: 453 Credit: 193,464,258 RAC: 4,895 |
Hi together, have come back to Atlas, all my Clients are back to Atlas, but one of them produces a lot of "Validate error". If I track the Result, it looks as if Input-File from Atlas is empty ? ! Can you follow this link: https://lhcathome.cern.ch/lhcathome/results.php?userid=555&offset=0&show_names=0&state=5&appid= Here is the Pilot-Log: 020-11-17 12:17:47 (6324): Guest Log: 2020-11-17 11:16:59,709 | WARNING | queue_monitoring | pilot.util.common | should_abort | data:queue_monitoring:received graceful stop - abort after this iteration 2020-11-17 12:17:47 (6324): Guest Log: 2020-11-17 11:17:02,711 | DEBUG | queue_monitoring | pilot.control.data | queue_monitoring | will not set job_aborted yet 2020-11-17 12:17:47 (6324): Guest Log: 2020-11-17 11:17:02,711 | DEBUG | queue_monitoring | pilot.control.data | queue_monitoring | [data] queue_monitor thread has finished 2020-11-17 12:17:47 (6324): Guest Log: 2020-11-17 11:17:45,609 | WARNING | job_monitor | pilot.control.job | check_job_monitor_waiting_time | no jobs in monitored_payloads queue (waited for 61 s) 2020-11-17 12:17:47 (6324): Guest Log: 2020-11-17 11:17:45,609 | DEBUG | job_monitor | pilot.util.processes | threads_aborted | aborting since the last relevant thread is about to finish 2020-11-17 12:17:47 (6324): Guest Log: 2020-11-17 11:17:45,609 | DEBUG | job_monitor | pilot.control.job | job_monitor | will proceed to set job_aborted 2020-11-17 12:17:47 (6324): Guest Log: 2020-11-17 11:17:45,609 | DEBUG | job_monitor | pilot.control.job | job_monitor | [job] job monitor thread has finished 2020-11-17 12:17:47 (6324): Guest Log: 2020-11-17 11:17:46,458 | INFO | MainThread | pilot.workflow.generic | run | end of generic workflow (traces error code: 0) 2020-11-17 12:17:47 (6324): Guest Log: 2020-11-17 11:17:46,458 | INFO | MainThread | root | wrap_up | traces error code: 0 2020-11-17 12:17:47 (6324): Guest Log: 2020-11-17 11:17:46,459 | INFO | MainThread | root | wrap_up | pilot has finished 2020-11-17 12:17:47 (6324): Guest Log: 2020-11-17 11:17:46,495 [wrapper] ==== pilot stdout END ==== Supporting BOINC, a great concept ! |
Send message Joined: 15 Jun 08 Posts: 2401 Credit: 225,548,852 RAC: 120,847 |
The link is not allowed for other users but I guess it's this computer: https://lhcathome.cern.ch/lhcathome/results.php?hostid=10383816 Looks like the faulty WUs produced errors on all computers they have been sent to. The interesting fact is that it's just 1 of your's that picked up all faulty tasks. What about the tasks that are currently shown "in progress"? Did you set them on hold or are they running fine? |
Send message Joined: 2 Sep 04 Posts: 453 Credit: 193,464,258 RAC: 4,895 |
The link is not allowed for other users but I guess it's this computer:You are right, that is the correct machine Looks like the faulty WUs produced errors on all computers they have been sent to.That is what made me thinking something could be wrong with this single box. What about the tasks that are currently shown "in progress"?I have set them on hold, because I couldn't babysit the box. Now, after a restart, I have started one WU again to see what happens. And I have switched off the proxy-setting via Squid to see, if this has something todo with the problem Supporting BOINC, a great concept ! |
Send message Joined: 15 Jun 08 Posts: 2401 Credit: 225,548,852 RAC: 120,847 |
It's most likely caused by a misconfigured computer from the aglt2 datacenter that returned 100% errors and swamped the project with >500 failed results. Root cause of the error is a missing squashfs packet. https://lhcathome.cern.ch/lhcathome/result.php?resultid=289127440 All other computers getting the resends will also return failed tasks. |
Send message Joined: 2 Sep 04 Posts: 453 Credit: 193,464,258 RAC: 4,895 |
Okay, finally it was only a small reason. In BOINC Proxy-Settings, I had only entered the "short" form of the machine running my Squid. SInce I changed this to the full qualified name all seems to be fine now. So, the finally answer to my question is: Faulty Box ;-) Supporting BOINC, a great concept ! |
©2024 CERN