Thread 'ATLAS switched to failover frontier server'

Author	Message
computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2753 Credit: 303,359,675 RAC: 100,042	Message 39219 - Posted: 28 Jun 2019, 8:27:52 UTC Since this morning I noticed a few 1000 requests to ccfrontier.in2p3.fr which usually only occurs if something is wrong with lcgft-atlas.gridpp.rl.ac.uk. ID: 39219 · Reply Quote

David Cameron Project administrator Project developer Project scientist Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0	Message 39244 - Posted: 2 Jul 2019, 14:29:38 UTC - in response to Message 39219. There is a problem with that server caused by overloading in the last few days (not from LHC@Home but other tasks). Experts are investigating. But in the meantime the failover should work ok. ID: 39244 · Reply Quote

Dark Angel Send message Joined: 7 Aug 11 Posts: 122 Credit: 34,110,295 RAC: 13,647	Message 39289 - Posted: 5 Jul 2019, 22:07:46 UTC Yeah, not so much. I've had to abort several work unit transfers as they just keep slowing down till they stop and have to be repeatedly restarted. When it happens BOIINC blocks the project and won't attempt any more transfers or request any more work until the download completes. ID: 39289 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2753 Credit: 303,359,675 RAC: 100,042	Message 39290 - Posted: 6 Jul 2019, 4:20:46 UTC - in response to Message 39289. Yeah, not so much. I've had to abort several work unit transfers as they just keep slowing down till they stop and have to be repeatedly restarted. When it happens BOIINC blocks the project and won't attempt any more transfers or request any more work until the download completes. What does "work unit transfers" mean for you? Download of EVNT files or upload of HITS files? Those transfers never go via lcgft-atlas.gridpp.rl.ac.uk or ccfrontier.in2p3.fr. As long as your firewall allows outgoing traffic to destination ports TCP 3128 (-> lcgft) and TCP 23128 (-> ccfrontier) you should not see any differences in speed. Your problems might have another reason. Further help is not possible as long as your computers are hidden. ID: 39290 · Reply Quote

Dark Angel Send message Joined: 7 Aug 11 Posts: 122 Credit: 34,110,295 RAC: 13,647	Message 39311 - Posted: 7 Jul 2019, 10:10:27 UTC - in response to Message 39290. Yeah, not so much. I've had to abort several work unit transfers as they just keep slowing down till they stop and have to be repeatedly restarted. When it happens BOIINC blocks the project and won't attempt any more transfers or request any more work until the download completes. What does "work unit transfers" mean for you? Download of EVNT files or upload of HITS files? Those transfers never go via lcgft-atlas.gridpp.rl.ac.uk or ccfrontier.in2p3.fr. As long as your firewall allows outgoing traffic to destination ports TCP 3128 (-> lcgft) and TCP 23128 (-> ccfrontier) you should not see any differences in speed. Your problems might have another reason. Further help is not possible as long as your computers are hidden. That thing where work units are transfered from the CERN servers to the client computers. In the BOINC client they show up under the "transfers" tab. I specifically mentioned downloads. I have no firewall restrictions at this time. I currently have two Atlas VM units that have been progressing extremely slowly for the last two days. Since they immediately stop when I suspend networking in the client, and given the repeated download failures of other work units I am suspicious of the project servers. I have had no trouble with any other project I have tried. The tone of your reply indicates you have no intention of helping, so I am just going to abort them and let someone else deal with you in the future. ID: 39311 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2753 Credit: 303,359,675 RAC: 100,042	Message 39312 - Posted: 7 Jul 2019, 10:58:36 UTC - in response to Message 39311. I currently have two Atlas VM units ... they immediately stop when I suspend networking in the client This is the reason why you don't see any progress. LHC VMs need permanent network access. The tone of your reply ... .. is always friendly as long as your's is friendly. ... and let someone else deal with you in the future. Agreed. ID: 39312 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2753 Credit: 303,359,675 RAC: 100,042	Message 41657 - Posted: 20 Feb 2020, 13:21:47 UTC AS task currently have problems to get data from atlasfrontier-ai.cern.ch and are using ccfrontier.in2p3.fr as failover. Unfortunately the failover server has also delivery problems for some requests: [pre][20/Feb/2020:14:08:03 +0100] "GET http://ccfrontier.in2p3.fr:23128/ccin2p3-AtlasFrontier/Frontier/type=frontier_request:1:DEFAULT&encoding=BLOBzip5&p1=eNpVjjsOwzAMQ6.izUuukEGx1NqALAeyAiOT73.L5tOgKKdHEBLZWDg6xCpb0ZFpelCx8AS6ieAiBxE6Dt-XB4X17elrVuOYW656nCe0-7BFFIZLL6sFUGQ4LuMuanfSExtD7co2B3TBRhQAlcDP-mvOHHohH.fTANWIDZb9N-0Dr1I2BA__ HTTP/1.0" 0 464 "-" "-" TCP_MISS_ABORTED:HIER_DIRECT . . . [20/Feb/2020:14:08:55 +0100] "GET http://ccfrontier.in2p3.fr:23128/ccin2p3-AtlasFrontier/Frontier/type=frontier_request:1:DEFAULT&encoding=BLOBzip5&p1=eNoLdvVxdQ5RcPYP9QvRCHdxdYl3cQxxjPd00VRwDFZQCg71DXJ1VlJwC-L3VVByDPFxDHZxUdJTgqs0CnF0V1JA54d7uAa5KqAI6oF5QAbQbAVbdUMDS0tLE3UAVNcg1g__ HTTP/1.0" 0 416 "-" "-" TCP_MISS_ABORTED:HIER_DIRECT[/pre] ID: 41657 · Reply Quote

Nils Volunteer moderator Project administrator Project developer Project tester Send message Joined: 15 Jul 05 Posts: 254 Credit: 6,001,083 RAC: 0	Message 41658 - Posted: 20 Feb 2020, 13:35:58 UTC - in response to Message 41657. We have issues with storage used for a number of CERN computing services today. The BOINC servers are not affected, but CVMFS and other layers used by the ATLAS (and CMS) application are. Hence there are no new ATLAS tasks generated now. Hopefully things should be back to normal soon. ID: 41658 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2753 Credit: 303,359,675 RAC: 100,042	Message 41660 - Posted: 20 Feb 2020, 13:44:49 UTC - in response to Message 41658. Thanks for explaining. Will keep my fingers crossed. ID: 41660 · Reply Quote

David Cameron Project administrator Project developer Project scientist Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0	Message 41663 - Posted: 20 Feb 2020, 14:32:59 UTC - in response to Message 41660. Unfortunately the issues Nils mentioned coincide with an ongoing planned ATLAS Frontier test where all Frontier servers are redirecting to the CERN one, so if the CERN one is not accessible then nothing will work. In addition the ATLAS services which submit WU to BOINC are affected so no new tasks are being created. ID: 41663 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2753 Credit: 303,359,675 RAC: 100,042	Message 41876 - Posted: 10 Mar 2020, 21:43:04 UTC Since this morning I notice failing frontier requests that cause ATLAS tasks to use the failover frontier server ccfrontier.in2p3.fr. Unfortunately this server also fails occasionally what causes the ATLAS tasks to switch back to atlasfrontier-ai.cern.ch. This makes me guess the real problem is hidden deeper in the CERN infrastructure. ID: 41876 · Reply Quote

David Cameron Project administrator Project developer Project scientist Send message Joined: 13 May 14 Posts: 387 Credit: 15,314,184 RAC: 0	Message 41892 - Posted: 12 Mar 2020, 8:05:09 UTC - in response to Message 41876. There have been some stress tests of the Frontier infrastructure by ATLAS in the last couple of days. These were tests of whether the single Frontier at CERN can handle the whole ATLAS load, since this might be the set up that we use in the future. Did it cause any tasks to fail, or were the connections automatically retried? ID: 41892 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2753 Credit: 303,359,675 RAC: 100,042	Message 41893 - Posted: 12 Mar 2020, 8:36:56 UTC - in response to Message 41892. Last modified: 12 Mar 2020, 8:46:52 UTC Within the last 2 days roughly 1% of my ATLAS tasks failed and I suspect automatic switching back and forth between atlasfrontier-ai.cern.ch and ccfrontier.in2p3.fr played a role. Last request to ccfrontier.in2p3.fr was 2020-03-11 19:19 UTC. Last invalid task was reported 2020-03-11 10:02 UTC: https://lhcathome.cern.ch/lhcathome/result.php?resultid=267098614 Since then all ATLAS native tasks reported valid results. <edit> The percentage of invalids might be higher on clients that don't use a local proxy. Since Frontier usually sends "max-age=3000" all requests within 3000 s after a successful refresh will be served from the local cache. </edit> ID: 41893 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2302 Credit: 179,708,200 RAC: 30,147	Message 41895 - Posted: 12 Mar 2020, 9:09:14 UTC - in response to Message 41893. Last modified: 12 Mar 2020, 9:35:22 UTC The percentage of invalids might be higher on clients that don't use a local proxy. Have no local proxy, Atlas was running without an Error-task yesterday (native and Windows). Only two failed in the morning because of the Stop between 7 and 9 UTC. Edit: http://cvmfs-stratum-one.cern.ch/cvmfs/atlas.cern.ch DIRECT 1 This is from a Atlas-Task under Windows. Is there openhtc.io possible? ID: 41895 · Reply Quote