Message boards : ATLAS application : ATLAS switched to failover frontier server
Message board moderation

To post messages, you must log in.

AuthorMessage
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1481
Credit: 79,860,504
RAC: 81,103
Message 39219 - Posted: 28 Jun 2019, 8:27:52 UTC

Since this morning I noticed a few 1000 requests to ccfrontier.in2p3.fr which usually only occurs if something is wrong with lcgft-atlas.gridpp.rl.ac.uk.
ID: 39219 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 321
Credit: 10,712,129
RAC: 6,040
Message 39244 - Posted: 2 Jul 2019, 14:29:38 UTC - in response to Message 39219.  

There is a problem with that server caused by overloading in the last few days (not from LHC@Home but other tasks). Experts are investigating. But in the meantime the failover should work ok.
ID: 39244 · Report as offensive     Reply Quote
Dark Angel

Send message
Joined: 7 Aug 11
Posts: 8
Credit: 2,975,587
RAC: 3,096
Message 39289 - Posted: 5 Jul 2019, 22:07:46 UTC

Yeah, not so much. I've had to abort several work unit transfers as they just keep slowing down till they stop and have to be repeatedly restarted. When it happens BOIINC blocks the project and won't attempt any more transfers or request any more work until the download completes.
ID: 39289 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1481
Credit: 79,860,504
RAC: 81,103
Message 39290 - Posted: 6 Jul 2019, 4:20:46 UTC - in response to Message 39289.  

Yeah, not so much. I've had to abort several work unit transfers as they just keep slowing down till they stop and have to be repeatedly restarted. When it happens BOIINC blocks the project and won't attempt any more transfers or request any more work until the download completes.

What does "work unit transfers" mean for you?
Download of EVNT files or upload of HITS files?
Those transfers never go via lcgft-atlas.gridpp.rl.ac.uk or ccfrontier.in2p3.fr.

As long as your firewall allows outgoing traffic to destination ports TCP 3128 (-> lcgft) and TCP 23128 (-> ccfrontier) you should not see any differences in speed.

Your problems might have another reason.
Further help is not possible as long as your computers are hidden.
ID: 39290 · Report as offensive     Reply Quote
Dark Angel

Send message
Joined: 7 Aug 11
Posts: 8
Credit: 2,975,587
RAC: 3,096
Message 39311 - Posted: 7 Jul 2019, 10:10:27 UTC - in response to Message 39290.  

Yeah, not so much. I've had to abort several work unit transfers as they just keep slowing down till they stop and have to be repeatedly restarted. When it happens BOIINC blocks the project and won't attempt any more transfers or request any more work until the download completes.

What does "work unit transfers" mean for you?
Download of EVNT files or upload of HITS files?
Those transfers never go via lcgft-atlas.gridpp.rl.ac.uk or ccfrontier.in2p3.fr.

As long as your firewall allows outgoing traffic to destination ports TCP 3128 (-> lcgft) and TCP 23128 (-> ccfrontier) you should not see any differences in speed.

Your problems might have another reason.
Further help is not possible as long as your computers are hidden.


That thing where work units are transfered from the CERN servers to the client computers. In the BOINC client they show up under the "transfers" tab. I specifically mentioned downloads. I have no firewall restrictions at this time.

I currently have two Atlas VM units that have been progressing extremely slowly for the last two days. Since they immediately stop when I suspend networking in the client, and given the repeated download failures of other work units I am suspicious of the project servers.

I have had no trouble with any other project I have tried.

The tone of your reply indicates you have no intention of helping, so I am just going to abort them and let someone else deal with you in the future.
ID: 39311 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1481
Credit: 79,860,504
RAC: 81,103
Message 39312 - Posted: 7 Jul 2019, 10:58:36 UTC - in response to Message 39311.  

I currently have two Atlas VM units ... they immediately stop when I suspend networking in the client

This is the reason why you don't see any progress.
LHC VMs need permanent network access.


The tone of your reply ...

.. is always friendly as long as your's is friendly.


... and let someone else deal with you in the future.

Agreed.
ID: 39312 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1481
Credit: 79,860,504
RAC: 81,103
Message 41657 - Posted: 20 Feb 2020, 13:21:47 UTC

My ATLAS task currently have problems to get data from atlasfrontier-ai.cern.ch and are using ccfrontier.in2p3.fr as failover.
Unfortunately the failover server has also delivery problems for some requests:
[20/Feb/2020:14:08:03 +0100] "GET http://ccfrontier.in2p3.fr:23128/ccin2p3-AtlasFrontier/Frontier/type=frontier_request:1:DEFAULT&encoding=BLOBzip5&p1=eNpVjjsOwzAMQ6.izUuukEGx1NqALAeyAiOT73.L5tOgKKdHEBLZWDg6xCpb0ZFpelCx8AS6ieAiBxE6Dt-XB4X17elrVuOYW656nCe0-7BFFIZLL6sFUGQ4LuMuanfSExtD7co2B3TBRhQAlcDP-mvOHHohH.fTANWIDZb9N-0Dr1I2BA__ HTTP/1.0" 0 464 "-" "-" TCP_MISS_ABORTED:HIER_DIRECT
.
.
.
[20/Feb/2020:14:08:55 +0100] "GET http://ccfrontier.in2p3.fr:23128/ccin2p3-AtlasFrontier/Frontier/type=frontier_request:1:DEFAULT&encoding=BLOBzip5&p1=eNoLdvVxdQ5RcPYP9QvRCHdxdYl3cQxxjPd00VRwDFZQCg71DXJ1VlJwC-L3VVByDPFxDHZxUdJTgqs0CnF0V1JA54d7uAa5KqAI6oF5QAbQbAVbdUMDS0tLE3UAVNcg1g__ HTTP/1.0" 0 416 "-" "-" TCP_MISS_ABORTED:HIER_DIRECT
ID: 41657 · Report as offensive     Reply Quote
Profile Nils Høimyr
Volunteer moderator
Project administrator
Project developer
Project tester

Send message
Joined: 15 Jul 05
Posts: 222
Credit: 5,073,627
RAC: 311
Message 41658 - Posted: 20 Feb 2020, 13:35:58 UTC - in response to Message 41657.  

We have issues with storage used for a number of CERN computing services today. The BOINC servers are not affected, but CVMFS and other layers used by the ATLAS (and CMS) application are. Hence there are no new ATLAS tasks generated now.

Hopefully things should be back to normal soon.
ID: 41658 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1481
Credit: 79,860,504
RAC: 81,103
Message 41660 - Posted: 20 Feb 2020, 13:44:49 UTC - in response to Message 41658.  

Thanks for explaining.
Will keep my fingers crossed.
ID: 41660 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 321
Credit: 10,712,129
RAC: 6,040
Message 41663 - Posted: 20 Feb 2020, 14:32:59 UTC - in response to Message 41660.  

Unfortunately the issues Nils mentioned coincide with an ongoing planned ATLAS Frontier test where all Frontier servers are redirecting to the CERN one, so if the CERN one is not accessible then nothing will work.

In addition the ATLAS services which submit WU to BOINC are affected so no new tasks are being created.
ID: 41663 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1481
Credit: 79,860,504
RAC: 81,103
Message 41876 - Posted: 10 Mar 2020, 21:43:04 UTC

Since this morning I notice failing frontier requests that cause ATLAS tasks to use the failover frontier server ccfrontier.in2p3.fr.
Unfortunately this server also fails occasionally what causes the ATLAS tasks to switch back to atlasfrontier-ai.cern.ch.

This makes me guess the real problem is hidden deeper in the CERN infrastructure.
ID: 41876 · Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project scientist

Send message
Joined: 13 May 14
Posts: 321
Credit: 10,712,129
RAC: 6,040
Message 41892 - Posted: 12 Mar 2020, 8:05:09 UTC - in response to Message 41876.  

There have been some stress tests of the Frontier infrastructure by ATLAS in the last couple of days. These were tests of whether the single Frontier at CERN can handle the whole ATLAS load, since this might be the set up that we use in the future. Did it cause any tasks to fail, or were the connections automatically retried?
ID: 41892 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1481
Credit: 79,860,504
RAC: 81,103
Message 41893 - Posted: 12 Mar 2020, 8:36:56 UTC - in response to Message 41892.  
Last modified: 12 Mar 2020, 8:46:52 UTC

Within the last 2 days roughly 1% of my ATLAS tasks failed and I suspect automatic switching back and forth between atlasfrontier-ai.cern.ch and ccfrontier.in2p3.fr played a role.
Last request to ccfrontier.in2p3.fr was 2020-03-11 19:19 UTC.

Last invalid task was reported 2020-03-11 10:02 UTC:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=267098614
Since then all ATLAS native tasks reported valid results.

<edit>
The percentage of invalids might be higher on clients that don't use a local proxy.
Since Frontier usually sends "max-age=3000" all requests within 3000 s after a successful refresh will be served from the local cache.
</edit>
ID: 41893 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 964
Credit: 34,069,930
RAC: 8,906
Message 41895 - Posted: 12 Mar 2020, 9:09:14 UTC - in response to Message 41893.  
Last modified: 12 Mar 2020, 9:35:22 UTC


The percentage of invalids might be higher on clients that don't use a local proxy.

Have no local proxy, Atlas was running without an Error-task yesterday (native and Windows).
Only two failed in the morning because of the Stop between 7 and 9 UTC.

Edit:
http://cvmfs-stratum-one.cern.ch/cvmfs/atlas.cern.ch DIRECT 1
This is from a Atlas-Task under Windows. Is there openhtc.io possible?
ID: 41895 · Report as offensive     Reply Quote

Message boards : ATLAS application : ATLAS switched to failover frontier server


©2020 CERN