Message boards : ATLAS application : Faulty Box or Faulty WUs ?
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 453
Credit: 193,464,258
RAC: 4,895
Message 43633 - Posted: 17 Nov 2020, 14:37:02 UTC

Hi together,

have come back to Atlas, all my Clients are back to Atlas, but one of them produces a lot of "Validate error".

If I track the Result, it looks as if Input-File from Atlas is empty ? !

Can you follow this link: https://lhcathome.cern.ch/lhcathome/results.php?userid=555&offset=0&show_names=0&state=5&appid=

Here is the Pilot-Log:

020-11-17 12:17:47 (6324): Guest Log: 2020-11-17 11:16:59,709 | WARNING | queue_monitoring | pilot.util.common | should_abort | data:queue_monitoring:received graceful stop - abort after this iteration
2020-11-17 12:17:47 (6324): Guest Log: 2020-11-17 11:17:02,711 | DEBUG | queue_monitoring | pilot.control.data | queue_monitoring | will not set job_aborted yet
2020-11-17 12:17:47 (6324): Guest Log: 2020-11-17 11:17:02,711 | DEBUG | queue_monitoring | pilot.control.data | queue_monitoring | [data] queue_monitor thread has finished
2020-11-17 12:17:47 (6324): Guest Log: 2020-11-17 11:17:45,609 | WARNING | job_monitor | pilot.control.job | check_job_monitor_waiting_time | no jobs in monitored_payloads queue (waited for 61 s)
2020-11-17 12:17:47 (6324): Guest Log: 2020-11-17 11:17:45,609 | DEBUG | job_monitor | pilot.util.processes | threads_aborted | aborting since the last relevant thread is about to finish
2020-11-17 12:17:47 (6324): Guest Log: 2020-11-17 11:17:45,609 | DEBUG | job_monitor | pilot.control.job | job_monitor | will proceed to set job_aborted
2020-11-17 12:17:47 (6324): Guest Log: 2020-11-17 11:17:45,609 | DEBUG | job_monitor | pilot.control.job | job_monitor | [job] job monitor thread has finished
2020-11-17 12:17:47 (6324): Guest Log: 2020-11-17 11:17:46,458 | INFO | MainThread | pilot.workflow.generic | run | end of generic workflow (traces error code: 0)
2020-11-17 12:17:47 (6324): Guest Log: 2020-11-17 11:17:46,458 | INFO | MainThread | root | wrap_up | traces error code: 0
2020-11-17 12:17:47 (6324): Guest Log: 2020-11-17 11:17:46,459 | INFO | MainThread | root | wrap_up | pilot has finished
2020-11-17 12:17:47 (6324): Guest Log: 2020-11-17 11:17:46,495 [wrapper] ==== pilot stdout END ====


Supporting BOINC, a great concept !
ID: 43633 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2401
Credit: 225,548,852
RAC: 120,847
Message 43634 - Posted: 17 Nov 2020, 15:24:53 UTC - in response to Message 43633.  
Last modified: 17 Nov 2020, 15:25:32 UTC

The link is not allowed for other users but I guess it's this computer:
https://lhcathome.cern.ch/lhcathome/results.php?hostid=10383816

Looks like the faulty WUs produced errors on all computers they have been sent to.
The interesting fact is that it's just 1 of your's that picked up all faulty tasks.

What about the tasks that are currently shown "in progress"?
Did you set them on hold or are they running fine?
ID: 43634 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 453
Credit: 193,464,258
RAC: 4,895
Message 43635 - Posted: 17 Nov 2020, 16:24:18 UTC - in response to Message 43634.  

The link is not allowed for other users but I guess it's this computer:
https://lhcathome.cern.ch/lhcathome/results.php?hostid=10383816
You are right, that is the correct machine

Looks like the faulty WUs produced errors on all computers they have been sent to.
The interesting fact is that it's just 1 of your's that picked up all faulty tasks.
That is what made me thinking something could be wrong with this single box.

What about the tasks that are currently shown "in progress"?
Did you set them on hold or are they running fine?
I have set them on hold, because I couldn't babysit the box.

Now, after a restart, I have started one WU again to see what happens. And I have switched off the proxy-setting via Squid to see, if this has something todo with the problem


Supporting BOINC, a great concept !
ID: 43635 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2401
Credit: 225,548,852
RAC: 120,847
Message 43637 - Posted: 17 Nov 2020, 16:51:54 UTC

It's most likely caused by a misconfigured computer from the aglt2 datacenter that returned 100% errors and swamped the project with >500 failed results.
Root cause of the error is a missing squashfs packet.
https://lhcathome.cern.ch/lhcathome/result.php?resultid=289127440

All other computers getting the resends will also return failed tasks.
ID: 43637 · Report as offensive     Reply Quote
Profile Yeti
Volunteer moderator
Avatar

Send message
Joined: 2 Sep 04
Posts: 453
Credit: 193,464,258
RAC: 4,895
Message 43647 - Posted: 18 Nov 2020, 14:30:17 UTC - in response to Message 43637.  

Okay, finally it was only a small reason.

In BOINC Proxy-Settings, I had only entered the "short" form of the machine running my Squid. SInce I changed this to the full qualified name all seems to be fine now.

So, the finally answer to my question is: Faulty Box ;-)


Supporting BOINC, a great concept !
ID: 43647 · Report as offensive     Reply Quote

Message boards : ATLAS application : Faulty Box or Faulty WUs ?


©2024 CERN