Message boards :
ATLAS application :
ATLAS switched to failover frontier server
Message board moderation
Author | Message |
---|---|
Send message Joined: 15 Jun 08 Posts: 2386 Credit: 222,894,316 RAC: 138,096 |
Since this morning I noticed a few 1000 requests to ccfrontier.in2p3.fr which usually only occurs if something is wrong with lcgft-atlas.gridpp.rl.ac.uk. |
Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0 |
There is a problem with that server caused by overloading in the last few days (not from LHC@Home but other tasks). Experts are investigating. But in the meantime the failover should work ok. |
Send message Joined: 7 Aug 11 Posts: 62 Credit: 21,010,850 RAC: 8,924 |
Yeah, not so much. I've had to abort several work unit transfers as they just keep slowing down till they stop and have to be repeatedly restarted. When it happens BOIINC blocks the project and won't attempt any more transfers or request any more work until the download completes. |
Send message Joined: 15 Jun 08 Posts: 2386 Credit: 222,894,316 RAC: 138,096 |
Yeah, not so much. I've had to abort several work unit transfers as they just keep slowing down till they stop and have to be repeatedly restarted. When it happens BOIINC blocks the project and won't attempt any more transfers or request any more work until the download completes. What does "work unit transfers" mean for you? Download of EVNT files or upload of HITS files? Those transfers never go via lcgft-atlas.gridpp.rl.ac.uk or ccfrontier.in2p3.fr. As long as your firewall allows outgoing traffic to destination ports TCP 3128 (-> lcgft) and TCP 23128 (-> ccfrontier) you should not see any differences in speed. Your problems might have another reason. Further help is not possible as long as your computers are hidden. |
Send message Joined: 7 Aug 11 Posts: 62 Credit: 21,010,850 RAC: 8,924 |
Yeah, not so much. I've had to abort several work unit transfers as they just keep slowing down till they stop and have to be repeatedly restarted. When it happens BOIINC blocks the project and won't attempt any more transfers or request any more work until the download completes. That thing where work units are transfered from the CERN servers to the client computers. In the BOINC client they show up under the "transfers" tab. I specifically mentioned downloads. I have no firewall restrictions at this time. I currently have two Atlas VM units that have been progressing extremely slowly for the last two days. Since they immediately stop when I suspend networking in the client, and given the repeated download failures of other work units I am suspicious of the project servers. I have had no trouble with any other project I have tried. The tone of your reply indicates you have no intention of helping, so I am just going to abort them and let someone else deal with you in the future. |
Send message Joined: 15 Jun 08 Posts: 2386 Credit: 222,894,316 RAC: 138,096 |
I currently have two Atlas VM units ... they immediately stop when I suspend networking in the client This is the reason why you don't see any progress. LHC VMs need permanent network access. The tone of your reply ... .. is always friendly as long as your's is friendly. ... and let someone else deal with you in the future. Agreed. |
Send message Joined: 15 Jun 08 Posts: 2386 Credit: 222,894,316 RAC: 138,096 |
My ATLAS task currently have problems to get data from atlasfrontier-ai.cern.ch and are using ccfrontier.in2p3.fr as failover. Unfortunately the failover server has also delivery problems for some requests: [20/Feb/2020:14:08:03 +0100] "GET http://ccfrontier.in2p3.fr:23128/ccin2p3-AtlasFrontier/Frontier/type=frontier_request:1:DEFAULT&encoding=BLOBzip5&p1=eNpVjjsOwzAMQ6.izUuukEGx1NqALAeyAiOT73.L5tOgKKdHEBLZWDg6xCpb0ZFpelCx8AS6ieAiBxE6Dt-XB4X17elrVuOYW656nCe0-7BFFIZLL6sFUGQ4LuMuanfSExtD7co2B3TBRhQAlcDP-mvOHHohH.fTANWIDZb9N-0Dr1I2BA__ HTTP/1.0" 0 464 "-" "-" TCP_MISS_ABORTED:HIER_DIRECT . . . [20/Feb/2020:14:08:55 +0100] "GET http://ccfrontier.in2p3.fr:23128/ccin2p3-AtlasFrontier/Frontier/type=frontier_request:1:DEFAULT&encoding=BLOBzip5&p1=eNoLdvVxdQ5RcPYP9QvRCHdxdYl3cQxxjPd00VRwDFZQCg71DXJ1VlJwC-L3VVByDPFxDHZxUdJTgqs0CnF0V1JA54d7uAa5KqAI6oF5QAbQbAVbdUMDS0tLE3UAVNcg1g__ HTTP/1.0" 0 416 "-" "-" TCP_MISS_ABORTED:HIER_DIRECT |
Send message Joined: 15 Jul 05 Posts: 242 Credit: 5,800,306 RAC: 0 |
We have issues with storage used for a number of CERN computing services today. The BOINC servers are not affected, but CVMFS and other layers used by the ATLAS (and CMS) application are. Hence there are no new ATLAS tasks generated now. Hopefully things should be back to normal soon. |
Send message Joined: 15 Jun 08 Posts: 2386 Credit: 222,894,316 RAC: 138,096 |
Thanks for explaining. Will keep my fingers crossed. |
Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0 |
Unfortunately the issues Nils mentioned coincide with an ongoing planned ATLAS Frontier test where all Frontier servers are redirecting to the CERN one, so if the CERN one is not accessible then nothing will work. In addition the ATLAS services which submit WU to BOINC are affected so no new tasks are being created. |
Send message Joined: 15 Jun 08 Posts: 2386 Credit: 222,894,316 RAC: 138,096 |
Since this morning I notice failing frontier requests that cause ATLAS tasks to use the failover frontier server ccfrontier.in2p3.fr. Unfortunately this server also fails occasionally what causes the ATLAS tasks to switch back to atlasfrontier-ai.cern.ch. This makes me guess the real problem is hidden deeper in the CERN infrastructure. |
Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0 |
There have been some stress tests of the Frontier infrastructure by ATLAS in the last couple of days. These were tests of whether the single Frontier at CERN can handle the whole ATLAS load, since this might be the set up that we use in the future. Did it cause any tasks to fail, or were the connections automatically retried? |
Send message Joined: 15 Jun 08 Posts: 2386 Credit: 222,894,316 RAC: 138,096 |
Within the last 2 days roughly 1% of my ATLAS tasks failed and I suspect automatic switching back and forth between atlasfrontier-ai.cern.ch and ccfrontier.in2p3.fr played a role. Last request to ccfrontier.in2p3.fr was 2020-03-11 19:19 UTC. Last invalid task was reported 2020-03-11 10:02 UTC: https://lhcathome.cern.ch/lhcathome/result.php?resultid=267098614 Since then all ATLAS native tasks reported valid results. <edit> The percentage of invalids might be higher on clients that don't use a local proxy. Since Frontier usually sends "max-age=3000" all requests within 3000 s after a successful refresh will be served from the local cache. </edit> |
Send message Joined: 2 May 07 Posts: 2071 Credit: 156,084,284 RAC: 104,872 |
Have no local proxy, Atlas was running without an Error-task yesterday (native and Windows). Only two failed in the morning because of the Stop between 7 and 9 UTC. Edit: http://cvmfs-stratum-one.cern.ch/cvmfs/atlas.cern.ch DIRECT 1 This is from a Atlas-Task under Windows. Is there openhtc.io possible? |
©2024 CERN