Network issue?

Author	Message
ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1110 Credit: 9,436,913 RAC: 8,290	Message 46244 - Posted: 14 Feb 2022, 10:07:04 UTC We are having a lot of job failures (50%) in the last hours where jobs are unable to contact the frontier database servers. I presume this is a problem at CERN but a quick search doesn't show anything relevant on the service availability page, and nothing in my mailbox. I'll investigate more thoroughly once I get in to work. ID: 46244 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2683 Credit: 286,876,687 RAC: 57,891	Message 46245 - Posted: 14 Feb 2022, 10:52:17 UTC - in response to Message 46244. I'm currently running >40 CMS tasks concurrently and had >200000 requests to cms-frontier.openhtc.io since 23:00 UTC. Roughly 5000 request were made to refresh the local Squid cache contents and so far none of them failed or produced unusual logfile records. ID: 46245 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1110 Credit: 9,436,913 RAC: 8,290	Message 46246 - Posted: 14 Feb 2022, 12:22:57 UTC Failure rate is falling. It seems to have been cured, but I've yet to find any evidence of what the problem was. ID: 46246 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1110 Credit: 9,436,913 RAC: 8,290	Message 46248 - Posted: 14 Feb 2022, 19:27:45 UTC Last modified: 14 Feb 2022, 19:44:42 UTC Ah, it might not be as general as I thought. Several machines are dominating the number of 8002 "fatal exception" errors that indicate a failure to contact any of the frontier servers, so it seems like it's not a systemic failure. Here are the counts of 8002 failures in the last 24 hours, tallied for individual machines. I don't think I can give the actual machine-IDs, due to privacy laws in Europe. Machine 1 1 Machine 2 1 Machine 3 2 Machine 4 3 Machine 5 3 Machine 6 5 Machine 7 8 Machine 8 12 Machine 9 33 Machine 10 38 Machine 11 40 Machine 12 55 Machine 13 102 Machine 14 103 Machine 15 128 Machine 16 138 Machine 17 148 Machine 18 160 Machine 19 275 Machine 20 466 ...and sorted by user-ID: User 1 1 User 2 1 User 3 2 User 4 3 User 5 3 User 6 8 User 7 12 User 8 33 User 9 40 User 10 43 User 11 55 User 12 102 User 13 128 User 14 138 User 15 148 User 16 160 User 17 275 User 18 569 ID: 46248 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 880 Credit: 746,678,744 RAC: 323,403	Message 46249 - Posted: 14 Feb 2022, 21:01:32 UTC How can I see my onw jobs, on Theory you could reverse engineer it on mcplots but on grafina I'm not sure. The BOINC jobs are fine of course. ID: 46249 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1110 Credit: 9,436,913 RAC: 8,290	Message 46250 - Posted: 14 Feb 2022, 21:26:17 UTC - in response to Message 46249. How can I see my onw jobs, on Theory you could reverse engineer it on mcplots but on grafina I'm not sure. The BOINC jobs are fine of course. Toby, I think you have CERN credentials, am I right? The web page I got these data from is https://monit-grafana.cern.ch/d/GybUsU6Gz/wmarchive-monit-copy?orgId=11&from=now-24h&to=now&var-exitCodes=All&var-wn_name=All&var-campaign=All&var-jobtype=All&var-host=vocms0267.cern.ch&var-site=T3_CH_CMSAtHome&var-site=T3_CH_Volunteer&var-jobstate=All The "Exit codes - WN Name" lists exit (error only!) codes against worker-node-name which is made up of {user-ID}-{machine-ID}-{random} where {random} varies for each VM instantiation (i.e. BOINC task). The grafana interface seems infinitely configurable, so you might be able to come up with a config that lists your machines. But please don't save any of your explorations to that particular URL! email me if you want to go further, by this time I doubt that I'm anonymous... ID: 46250 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1110 Credit: 9,436,913 RAC: 8,290	Message 46262 - Posted: 17 Feb 2022, 9:51:19 UTC This problem disappeared around lunchtime yesterday. I still have no idea why it occurred. ID: 46262 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 880 Credit: 746,678,744 RAC: 323,403	Message 46280 - Posted: 18 Feb 2022, 19:23:20 UTC - in response to Message 46250. Last modified: 18 Feb 2022, 19:48:07 UTC Great thanks, will look around 360 jobs for my computers, 12 fails 3% 47% 60307, 18% 8016, 134, 50115, 99999 all 12% what are the cost maybe I could bring down? ID: 46280 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1110 Credit: 9,436,913 RAC: 8,290	Message 46327 - Posted: 23 Feb 2022, 15:35:03 UTC - in response to Message 46280. Last modified: 23 Feb 2022, 15:40:22 UTC Great thanks, will look around 360 jobs for my computers, 12 fails 3% 47% 60307, 18% 8016, 134, 50115, 99999 all 12% what are the cost maybe I could bring down? Looking at https://twiki.cern.ch/twiki/bin/view/CMSPublic/JobExitCodes: 60307 is Log Archive Failure, so that's a failure to transfer the log file /srv/job/WMTaskSpace/logArch1/logArchive.tar.gz to the data-bridge at https://data-bridge.cern.ch/myfed/cms-output/store/unmerged/logs/. I don't see any explicit reason in the logs I can access, so I'd have to put it down as a transient network problem. Not much you can do about that except make sure your connections aren't bandwidth-limited. 8016 is a fatal exception which appears to arise from something going wrong in track propagation in cmsRun: An exception of category 'EventCorruption' occurred while [0] Processing Event run: 1 lumi: 345105 event: 172552245 stream: 0 [1] Running path 'FEVTDEBUGoutput_step' [2] Prefetching for module PoolOutputModule/'FEVTDEBUGoutput' [3] Calling method for module OscarMTProducer/'g4SimHits' Exception Message: SimG4CoreApplication exception in generation of event run: 1 lumi: 345105 event: 172552245 in stream 0 -------- EEEE ------- G4Exception-START -------- EEEE ------- *** G4Exception : GeomNav0003 issued by : G4Navigator::ComputeStep() Stuck Track: potential geometry or navigation problem. Track stuck, not moving for 25 steps in volume -BeamTube11- at point (1.11409,-51.9838,3491.05) direction: (0.604979,-0.694067,0.39022). -------- EEEE -------- G4Exception-END --------- EEEE ------- So there's probably nothing you can do about that! If I recall correctly, the other three usually occur together (a job can return more than one exit code). 134 is Unix "Fatal error signal 6" which is SIGABRT, and that cascades down to cmsRun code 50115 - "cmsRun did not produce a valid job report at runtime (often means cmsRun segfaulted)" and then there's a further cascade down to 99999 which appears to be a catchall for "something went wrong but I'm not sure what". Again, that looks like a calculation problem. ID: 46327 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 880 Credit: 746,678,744 RAC: 323,403	Message 46333 - Posted: 24 Feb 2022, 7:48:37 UTC - in response to Message 46327. Last modified: 24 Feb 2022, 7:56:29 UTC Thanks for looking. I'm not sure what I can do about the network, I have the fastest connection that is available to me at up to 10 GB/s, the connection from the squid server in my house is only 1 GB as is most of the computers so I could upgrade the NICs so that all are running on 10 GB all the way though. It seems sort of regular pattern [/url] ID: 46333 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2277 Credit: 178,705,839 RAC: 104,384	Message 46334 - Posted: 24 Feb 2022, 8:22:11 UTC - in response to Message 46333. My experience with Squid is to build it on the fastest PC with a CentOs8-VM (Red-Hat Installationguide). Have all PC's running with 1 Gbit/s. 32 CMS tasks in Win11pro and Win10pro running at the same time without Errors. ID: 46334 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2683 Credit: 286,876,687 RAC: 57,891	Message 46335 - Posted: 24 Feb 2022, 9:09:37 UTC - in response to Message 46333. Toby's Grafana statistics currently shows 12 failures with error code 60307. This page gives a more detailed explanation: https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideCrabFaq#Exit_code_60307 Especially: "if you are using EOS at cern: the problem is due to the restart of the SRM service making all connections fail. The only action in this case is to keep trying." I would suspect that one (of roughly 10) EOS target server is unavailable during the upload attempt. In this case the task does a retry after a few minutes and since the target system is selected through DNS it is very likely the retry gets a different (hopefully responding) target server. I don't know whether the first failed attempt counts as error in the statistics. Ivan may know the experts who can clarify this. ID: 46335 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1110 Credit: 9,436,913 RAC: 8,290	Message 46350 - Posted: 24 Feb 2022, 16:10:46 UTC - in response to Message 46335. Toby's Grafana statistics currently shows 12 failures with error code 60307. This page gives a more detailed explanation: https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideCrabFaq#Exit_code_60307 Especially: "if you are using EOS at cern: the problem is due to the restart of the SRM service making all connections fail. The only action in this case is to keep trying." I would suspect that one (of roughly 10) EOS target server is unavailable during the upload attempt. In this case the task does a retry after a few minutes and since the target system is selected through DNS it is very likely the retry gets a different (hopefully responding) target server. I don't know whether the first failed attempt counts as error in the statistics. Ivan may know the experts who can clarify this. I do see dual reports of failure in 60307 jobs, but they are not spaced in time: Message: Command exited non-zero, ExitCode:5 Output: stdout: Thu Feb 24 00:21:31 EST 2022 Copying 331895 bytes file:///srv/job/WMTaskSpace/logArch1/logArchive.tar.gz => https://data-bridge.cern.ch/myfed/cms- ... Message: Command exited non-zero, ExitCode:5 Output: stdout: Thu Feb 24 00:21:31 EST 2022 Copying 331895 bytes file:///srv/job/WMTaskSpace/logArch1/logArchive.tar.gz => https://data-bridge.cern.ch/myfed/cms- ID: 46350 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 880 Credit: 746,678,744 RAC: 323,403	Message 46352 - Posted: 24 Feb 2022, 17:13:51 UTC - in response to Message 46334. I'm not sure what fastest is but its running on this computer Xeon Platinum 8249C. It running WUs at the same time though so there is performance loss there. ID: 46352 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2277 Credit: 178,705,839 RAC: 104,384	Message 46353 - Posted: 24 Feb 2022, 19:18:06 UTC eoscms show this error for me: root://eoscms.cern.ch//eos/cms/store/logs/prod/recent/TESTBED/ireid_TC_SLC7_IDR_CMS_Home_220223_155724_7756/SinglePiE50HCAL_pythia8_2018_GenSimFull/vocms0267.cern.ch-150443-0-log.tar.gz ErrorCode : 60311 ErrorType : GeneralStageOutFailure two hour no new job inside of the task now. ID: 46353 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1110 Credit: 9,436,913 RAC: 8,290	Message 46461 - Posted: 18 Mar 2022, 8:17:15 UTC At this point I could just cut-`n'-paste the original message in the thread, verbatim... ID: 46461 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1110 Credit: 9,436,913 RAC: 8,290	Message 46463 - Posted: 18 Mar 2022, 10:45:00 UTC - in response to Message 46461. At this point I could just cut-`n'-paste the original message in the thread, verbatim... It seems to be recovering now. I haven't found anything on the message boards or in the service interruption logs, so the source is still a mystery. ID: 46463 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1110 Credit: 9,436,913 RAC: 8,290	Message 47222 - Posted: 4 Sep 2022, 15:48:37 UTC There seems to be a general problem with CMS jobs at the moment (if you click on the backarrow on one of the job graphs -- not your browser's backarrow -- you can select to see data from all CMS sites, not just ours). There were a few hours of reporting no jobs running; now that is more normal but completed jobs are still showing zero. This suggests a database problem, but I've not yet seen any reports of outages. ID: 47222 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1110 Credit: 9,436,913 RAC: 8,290	Message 47225 - Posted: 5 Sep 2022, 14:00:41 UTC - in response to Message 47222. There seems to be a general problem with CMS jobs at the moment (if you click on the backarrow on one of the job graphs -- not your browser's backarrow -- you can select to see data from all CMS sites, not just ours). There were a few hours of reporting no jobs running; now that is more normal but completed jobs are still showing zero. This suggests a database problem, but I've not yet seen any reports of outages. Other production sources were showing completed production jobs again, but not us. A DBS component of our WMAgent was restarted and now we are showing completed jobs again, too. ID: 47225 · Reply Quote

Ryan Munro Send message Joined: 17 Aug 17 Posts: 124 Credit: 10,843,766 RAC: 10,878	Message 47235 - Posted: 6 Sep 2022, 14:49:28 UTC - in response to Message 47225. I have 4 jobs in my queue that are just sat at waiting to run, even when I pause all other jobs? ID: 47235 · Reply Quote

LHC@home