Message boards : CMS Application : Network issue?
Message board moderation

To post messages, you must log in.

AuthorMessage
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 893
Credit: 5,852,906
RAC: 1
Message 46244 - Posted: 14 Feb 2022, 10:07:04 UTC

We are having a lot of job failures (50%) in the last hours where jobs are unable to contact the frontier database servers. I presume this is a problem at CERN but a quick search doesn't show anything relevant on the service availability page, and nothing in my mailbox. I'll investigate more thoroughly once I get in to work.
ID: 46244 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 2048
Credit: 154,038,753
RAC: 151,386
Message 46245 - Posted: 14 Feb 2022, 10:52:17 UTC - in response to Message 46244.  

I'm currently running >40 CMS tasks concurrently and had >200000 requests to cms-frontier.openhtc.io since 23:00 UTC.
Roughly 5000 request were made to refresh the local Squid cache contents and so far none of them failed or produced unusual logfile records.
ID: 46245 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 893
Credit: 5,852,906
RAC: 1
Message 46246 - Posted: 14 Feb 2022, 12:22:57 UTC

Failure rate is falling. It seems to have been cured, but I've yet to find any evidence of what the problem was.
ID: 46246 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 893
Credit: 5,852,906
RAC: 1
Message 46248 - Posted: 14 Feb 2022, 19:27:45 UTC
Last modified: 14 Feb 2022, 19:44:42 UTC

Ah, it might not be as general as I thought. Several machines are dominating the number of 8002 "fatal exception" errors that indicate a failure to contact any of the frontier servers, so it seems like it's not a systemic failure. Here are the counts of 8002 failures in the last 24 hours, tallied for individual machines. I don't think I can give the actual machine-IDs, due to privacy laws in Europe.

Machine 1 1
Machine 2 1
Machine 3 2
Machine 4 3
Machine 5 3
Machine 6 5
Machine 7 8
Machine 8 12
Machine 9 33
Machine 10 38
Machine 11 40
Machine 12 55
Machine 13 102
Machine 14 103
Machine 15 128
Machine 16 138
Machine 17 148
Machine 18 160
Machine 19 275
Machine 20 466


...and sorted by user-ID:

User 1 1
User 2 1
User 3 2
User 4 3
User 5 3
User 6 8
User 7 12
User 8 33
User 9 40
User 10 43
User 11 55
User 12 102
User 13 128
User 14 138
User 15 148
User 16 160
User 17 275
User 18 569

ID: 46248 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 730
Credit: 504,428,790
RAC: 285,198
Message 46249 - Posted: 14 Feb 2022, 21:01:32 UTC

How can I see my onw jobs, on Theory you could reverse engineer it on mcplots but on grafina I'm not sure.

The BOINC jobs are fine of course.
ID: 46249 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 893
Credit: 5,852,906
RAC: 1
Message 46250 - Posted: 14 Feb 2022, 21:26:17 UTC - in response to Message 46249.  

How can I see my onw jobs, on Theory you could reverse engineer it on mcplots but on grafina I'm not sure.

The BOINC jobs are fine of course.

Toby, I think you have CERN credentials, am I right?
The web page I got these data from is
https://monit-grafana.cern.ch/d/GybUsU6Gz/wmarchive-monit-copy?orgId=11&from=now-24h&to=now&var-exitCodes=All&var-wn_name=All&var-campaign=All&var-jobtype=All&var-host=vocms0267.cern.ch&var-site=T3_CH_CMSAtHome&var-site=T3_CH_Volunteer&var-jobstate=All
The "Exit codes - WN Name" lists exit (error only!) codes against worker-node-name which is made up of {user-ID}-{machine-ID}-{random} where {random} varies for each VM instantiation (i.e. BOINC task). The grafana interface seems infinitely configurable, so you might be able to come up with a config that lists your machines. But please don't save any of your explorations to that particular URL!
email me if you want to go further, by this time I doubt that I'm anonymous...
ID: 46250 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 893
Credit: 5,852,906
RAC: 1
Message 46262 - Posted: 17 Feb 2022, 9:51:19 UTC

This problem disappeared around lunchtime yesterday. I still have no idea why it occurred.
ID: 46262 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 730
Credit: 504,428,790
RAC: 285,198
Message 46280 - Posted: 18 Feb 2022, 19:23:20 UTC - in response to Message 46250.  
Last modified: 18 Feb 2022, 19:48:07 UTC

Great thanks, will look around

360 jobs for my computers, 12 fails 3%

47% 60307, 18% 8016, 134, 50115, 99999 all 12%

what are the cost maybe I could bring down?
ID: 46280 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 893
Credit: 5,852,906
RAC: 1
Message 46327 - Posted: 23 Feb 2022, 15:35:03 UTC - in response to Message 46280.  
Last modified: 23 Feb 2022, 15:40:22 UTC

Great thanks, will look around

360 jobs for my computers, 12 fails 3%

47% 60307, 18% 8016, 134, 50115, 99999 all 12%

what are the cost maybe I could bring down?

Looking at https://twiki.cern.ch/twiki/bin/view/CMSPublic/JobExitCodes:
60307 is Log Archive Failure, so that's a failure to transfer the log file /srv/job/WMTaskSpace/logArch1/logArchive.tar.gz to the data-bridge at https://data-bridge.cern.ch/myfed/cms-output/store/unmerged/logs/. I don't see any explicit reason in the logs I can access, so I'd have to put it down as a transient network problem. Not much you can do about that except make sure your connections aren't bandwidth-limited.
8016 is a fatal exception which appears to arise from something going wrong in track propagation in cmsRun:
An exception of category 'EventCorruption' occurred while
   [0] Processing  Event run: 1 lumi: 345105 event: 172552245 stream: 0
   [1] Running path 'FEVTDEBUGoutput_step'
   [2] Prefetching for module PoolOutputModule/'FEVTDEBUGoutput'
   [3] Calling method for module OscarMTProducer/'g4SimHits'
Exception Message:
SimG4CoreApplication exception in generation of event run: 1 lumi: 345105 event: 172552245 in stream 0 

-------- EEEE ------- G4Exception-START -------- EEEE -------
*** G4Exception : GeomNav0003
      issued by : G4Navigator::ComputeStep()
Stuck Track: potential geometry or navigation problem.
        Track stuck, not moving for 25 steps
        in volume -BeamTube11- at point (1.11409,-51.9838,3491.05)
        direction: (0.604979,-0.694067,0.39022).
-------- EEEE -------- G4Exception-END --------- EEEE -------
So there's probably nothing you can do about that!
If I recall correctly, the other three usually occur together (a job can return more than one exit code). 134 is Unix "Fatal error signal 6" which is SIGABRT, and that cascades down to cmsRun code 50115 - "cmsRun did not produce a valid job report at runtime (often means cmsRun segfaulted)" and then there's a further cascade down to 99999 which appears to be a catchall for "something went wrong but I'm not sure what". Again, that looks like a calculation problem.
ID: 46327 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 730
Credit: 504,428,790
RAC: 285,198
Message 46333 - Posted: 24 Feb 2022, 7:48:37 UTC - in response to Message 46327.  
Last modified: 24 Feb 2022, 7:56:29 UTC

Thanks for looking.

I'm not sure what I can do about the network, I have the fastest connection that is available to me at up to 10 GB/s, the connection from the squid server in my house is only 1 GB as is most of the computers so I could upgrade the NICs so that all are running on 10 GB all the way though.

It seems sort of regular pattern

[/url]
ID: 46333 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1619
Credit: 74,951,054
RAC: 205,209
Message 46334 - Posted: 24 Feb 2022, 8:22:11 UTC - in response to Message 46333.  

My experience with Squid is to build it on the fastest PC with a CentOs8-VM (Red-Hat Installationguide).
Have all PC's running with 1 Gbit/s.
32 CMS tasks in Win11pro and Win10pro running at the same time without Errors.
ID: 46334 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 2048
Credit: 154,038,753
RAC: 151,386
Message 46335 - Posted: 24 Feb 2022, 9:09:37 UTC - in response to Message 46333.  

Toby's Grafana statistics currently shows 12 failures with error code 60307.
This page gives a more detailed explanation:
https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideCrabFaq#Exit_code_60307
Especially:
"if you are using EOS at cern: the problem is due to the restart of the SRM service making all connections fail. The only action in this case is to keep trying."

I would suspect that one (of roughly 10) EOS target server is unavailable during the upload attempt.
In this case the task does a retry after a few minutes and since the target system is selected through DNS it is very likely the retry gets a different (hopefully responding) target server.


I don't know whether the first failed attempt counts as error in the statistics.
Ivan may know the experts who can clarify this.
ID: 46335 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 893
Credit: 5,852,906
RAC: 1
Message 46350 - Posted: 24 Feb 2022, 16:10:46 UTC - in response to Message 46335.  

Toby's Grafana statistics currently shows 12 failures with error code 60307.
This page gives a more detailed explanation:
https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideCrabFaq#Exit_code_60307
Especially:
"if you are using EOS at cern: the problem is due to the restart of the SRM service making all connections fail. The only action in this case is to keep trying."

I would suspect that one (of roughly 10) EOS target server is unavailable during the upload attempt.
In this case the task does a retry after a few minutes and since the target system is selected through DNS it is very likely the retry gets a different (hopefully responding) target server.


I don't know whether the first failed attempt counts as error in the statistics.
Ivan may know the experts who can clarify this.

I do see dual reports of failure in 60307 jobs, but they are not spaced in time:
Message: Command exited non-zero, ExitCode:5
Output: stdout: Thu Feb 24 00:21:31 EST 2022
Copying 331895 bytes file:///srv/job/WMTaskSpace/logArch1/logArchive.tar.gz => https://data-bridge.cern.ch/myfed/cms-
...
Message: Command exited non-zero, ExitCode:5
Output: stdout: Thu Feb 24 00:21:31 EST 2022
Copying 331895 bytes file:///srv/job/WMTaskSpace/logArch1/logArchive.tar.gz => https://data-bridge.cern.ch/myfed/cms-

ID: 46350 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 730
Credit: 504,428,790
RAC: 285,198
Message 46352 - Posted: 24 Feb 2022, 17:13:51 UTC - in response to Message 46334.  

I'm not sure what fastest is but its running on this computer Xeon Platinum 8249C.

It running WUs at the same time though so there is performance loss there.
ID: 46352 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1619
Credit: 74,951,054
RAC: 205,209
Message 46353 - Posted: 24 Feb 2022, 19:18:06 UTC

eoscms show this error for me:
root://eoscms.cern.ch//eos/cms/store/logs/prod/recent/TESTBED/ireid_TC_SLC7_IDR_CMS_Home_220223_155724_7756/SinglePiE50HCAL_pythia8_2018_GenSimFull/vocms0267.cern.ch-150443-0-log.tar.gz
ErrorCode : 60311
ErrorType : GeneralStageOutFailure

two hour no new job inside of the task now.
ID: 46353 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 893
Credit: 5,852,906
RAC: 1
Message 46461 - Posted: 18 Mar 2022, 8:17:15 UTC

At this point I could just cut-`n'-paste the original message in the thread, verbatim...
ID: 46461 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 893
Credit: 5,852,906
RAC: 1
Message 46463 - Posted: 18 Mar 2022, 10:45:00 UTC - in response to Message 46461.  

At this point I could just cut-`n'-paste the original message in the thread, verbatim...

It seems to be recovering now. I haven't found anything on the message boards or in the service interruption logs, so the source is still a mystery.
ID: 46463 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 893
Credit: 5,852,906
RAC: 1
Message 47222 - Posted: 4 Sep 2022, 15:48:37 UTC

There seems to be a general problem with CMS jobs at the moment (if you click on the backarrow on one of the job graphs -- not your browser's backarrow -- you can select to see data from all CMS sites, not just ours). There were a few hours of reporting no jobs running; now that is more normal but completed jobs are still showing zero. This suggests a database problem, but I've not yet seen any reports of outages.
ID: 47222 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 893
Credit: 5,852,906
RAC: 1
Message 47225 - Posted: 5 Sep 2022, 14:00:41 UTC - in response to Message 47222.  

There seems to be a general problem with CMS jobs at the moment (if you click on the backarrow on one of the job graphs -- not your browser's backarrow -- you can select to see data from all CMS sites, not just ours). There were a few hours of reporting no jobs running; now that is more normal but completed jobs are still showing zero. This suggests a database problem, but I've not yet seen any reports of outages.

Other production sources were showing completed production jobs again, but not us. A DBS component of our WMAgent was restarted and now we are showing completed jobs again, too.
ID: 47225 · Report as offensive     Reply Quote
Ryan Munro

Send message
Joined: 17 Aug 17
Posts: 25
Credit: 1,224,844
RAC: 884
Message 47235 - Posted: 6 Sep 2022, 14:49:28 UTC - in response to Message 47225.  

I have 4 jobs in my queue that are just sat at waiting to run, even when I pause all other jobs?
ID: 47235 · Report as offensive     Reply Quote

Message boards : CMS Application : Network issue?


©2022 CERN