Message boards :
CMS Application :
Network issue?
Message board moderation
Author | Message |
---|---|
![]() Send message Joined: 29 Aug 05 Posts: 961 Credit: 6,231,115 RAC: 0 ![]() |
We are having a lot of job failures (50%) in the last hours where jobs are unable to contact the frontier database servers. I presume this is a problem at CERN but a quick search doesn't show anything relevant on the service availability page, and nothing in my mailbox. I'll investigate more thoroughly once I get in to work. ![]() |
![]() Send message Joined: 15 Jun 08 Posts: 2285 Credit: 207,715,163 RAC: 143,110 ![]() ![]() ![]() |
I'm currently running >40 CMS tasks concurrently and had >200000 requests to cms-frontier.openhtc.io since 23:00 UTC. Roughly 5000 request were made to refresh the local Squid cache contents and so far none of them failed or produced unusual logfile records. |
![]() Send message Joined: 29 Aug 05 Posts: 961 Credit: 6,231,115 RAC: 0 ![]() |
|
![]() Send message Joined: 29 Aug 05 Posts: 961 Credit: 6,231,115 RAC: 0 ![]() |
Ah, it might not be as general as I thought. Several machines are dominating the number of 8002 "fatal exception" errors that indicate a failure to contact any of the frontier servers, so it seems like it's not a systemic failure. Here are the counts of 8002 failures in the last 24 hours, tallied for individual machines. I don't think I can give the actual machine-IDs, due to privacy laws in Europe. Machine 1 1 Machine 2 1 Machine 3 2 Machine 4 3 Machine 5 3 Machine 6 5 Machine 7 8 Machine 8 12 Machine 9 33 Machine 10 38 Machine 11 40 Machine 12 55 Machine 13 102 Machine 14 103 Machine 15 128 Machine 16 138 Machine 17 148 Machine 18 160 Machine 19 275 Machine 20 466 ...and sorted by user-ID: User 1 1 User 2 1 User 3 2 User 4 3 User 5 3 User 6 8 User 7 12 User 8 33 User 9 40 User 10 43 User 11 55 User 12 102 User 13 128 User 14 138 User 15 148 User 16 160 User 17 275 User 18 569 ![]() |
Send message Joined: 27 Sep 08 Posts: 780 Credit: 619,683,868 RAC: 207,765 ![]() ![]() ![]() |
How can I see my onw jobs, on Theory you could reverse engineer it on mcplots but on grafina I'm not sure. The BOINC jobs are fine of course. |
![]() Send message Joined: 29 Aug 05 Posts: 961 Credit: 6,231,115 RAC: 0 ![]() |
How can I see my onw jobs, on Theory you could reverse engineer it on mcplots but on grafina I'm not sure. Toby, I think you have CERN credentials, am I right? The web page I got these data from is https://monit-grafana.cern.ch/d/GybUsU6Gz/wmarchive-monit-copy?orgId=11&from=now-24h&to=now&var-exitCodes=All&var-wn_name=All&var-campaign=All&var-jobtype=All&var-host=vocms0267.cern.ch&var-site=T3_CH_CMSAtHome&var-site=T3_CH_Volunteer&var-jobstate=All The "Exit codes - WN Name" lists exit (error only!) codes against worker-node-name which is made up of {user-ID}-{machine-ID}-{random} where {random} varies for each VM instantiation (i.e. BOINC task). The grafana interface seems infinitely configurable, so you might be able to come up with a config that lists your machines. But please don't save any of your explorations to that particular URL! email me if you want to go further, by this time I doubt that I'm anonymous... ![]() |
![]() Send message Joined: 29 Aug 05 Posts: 961 Credit: 6,231,115 RAC: 0 ![]() |
|
Send message Joined: 27 Sep 08 Posts: 780 Credit: 619,683,868 RAC: 207,765 ![]() ![]() ![]() |
Great thanks, will look around 360 jobs for my computers, 12 fails 3% 47% 60307, 18% 8016, 134, 50115, 99999 all 12% what are the cost maybe I could bring down? |
![]() Send message Joined: 29 Aug 05 Posts: 961 Credit: 6,231,115 RAC: 0 ![]() |
Great thanks, will look around Looking at https://twiki.cern.ch/twiki/bin/view/CMSPublic/JobExitCodes: 60307 is Log Archive Failure, so that's a failure to transfer the log file /srv/job/WMTaskSpace/logArch1/logArchive.tar.gz to the data-bridge at https://data-bridge.cern.ch/myfed/cms-output/store/unmerged/logs/. I don't see any explicit reason in the logs I can access, so I'd have to put it down as a transient network problem. Not much you can do about that except make sure your connections aren't bandwidth-limited. 8016 is a fatal exception which appears to arise from something going wrong in track propagation in cmsRun: An exception of category 'EventCorruption' occurred while [0] Processing Event run: 1 lumi: 345105 event: 172552245 stream: 0 [1] Running path 'FEVTDEBUGoutput_step' [2] Prefetching for module PoolOutputModule/'FEVTDEBUGoutput' [3] Calling method for module OscarMTProducer/'g4SimHits' Exception Message: SimG4CoreApplication exception in generation of event run: 1 lumi: 345105 event: 172552245 in stream 0 -------- EEEE ------- G4Exception-START -------- EEEE ------- *** G4Exception : GeomNav0003 issued by : G4Navigator::ComputeStep() Stuck Track: potential geometry or navigation problem. Track stuck, not moving for 25 steps in volume -BeamTube11- at point (1.11409,-51.9838,3491.05) direction: (0.604979,-0.694067,0.39022). -------- EEEE -------- G4Exception-END --------- EEEE -------So there's probably nothing you can do about that! If I recall correctly, the other three usually occur together (a job can return more than one exit code). 134 is Unix "Fatal error signal 6" which is SIGABRT, and that cascades down to cmsRun code 50115 - "cmsRun did not produce a valid job report at runtime (often means cmsRun segfaulted)" and then there's a further cascade down to 99999 which appears to be a catchall for "something went wrong but I'm not sure what". Again, that looks like a calculation problem. ![]() |
Send message Joined: 27 Sep 08 Posts: 780 Credit: 619,683,868 RAC: 207,765 ![]() ![]() ![]() |
Thanks for looking. I'm not sure what I can do about the network, I have the fastest connection that is available to me at up to 10 GB/s, the connection from the squid server in my house is only 1 GB as is most of the computers so I could upgrade the NICs so that all are running on 10 GB all the way though. It seems sort of regular pattern ![]() |
Send message Joined: 2 May 07 Posts: 1883 Credit: 145,309,938 RAC: 103,175 ![]() ![]() ![]() |
My experience with Squid is to build it on the fastest PC with a CentOs8-VM (Red-Hat Installationguide). Have all PC's running with 1 Gbit/s. 32 CMS tasks in Win11pro and Win10pro running at the same time without Errors. |
![]() Send message Joined: 15 Jun 08 Posts: 2285 Credit: 207,715,163 RAC: 143,110 ![]() ![]() ![]() |
Toby's Grafana statistics currently shows 12 failures with error code 60307. This page gives a more detailed explanation: https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideCrabFaq#Exit_code_60307 Especially: "if you are using EOS at cern: the problem is due to the restart of the SRM service making all connections fail. The only action in this case is to keep trying." I would suspect that one (of roughly 10) EOS target server is unavailable during the upload attempt. In this case the task does a retry after a few minutes and since the target system is selected through DNS it is very likely the retry gets a different (hopefully responding) target server. I don't know whether the first failed attempt counts as error in the statistics. Ivan may know the experts who can clarify this. |
![]() Send message Joined: 29 Aug 05 Posts: 961 Credit: 6,231,115 RAC: 0 ![]() |
Toby's Grafana statistics currently shows 12 failures with error code 60307. I do see dual reports of failure in 60307 jobs, but they are not spaced in time: Message: Command exited non-zero, ExitCode:5 Output: stdout: Thu Feb 24 00:21:31 EST 2022 Copying 331895 bytes file:///srv/job/WMTaskSpace/logArch1/logArchive.tar.gz => https://data-bridge.cern.ch/myfed/cms- ... Message: Command exited non-zero, ExitCode:5 Output: stdout: Thu Feb 24 00:21:31 EST 2022 Copying 331895 bytes file:///srv/job/WMTaskSpace/logArch1/logArchive.tar.gz => https://data-bridge.cern.ch/myfed/cms- ![]() |
Send message Joined: 27 Sep 08 Posts: 780 Credit: 619,683,868 RAC: 207,765 ![]() ![]() ![]() |
I'm not sure what fastest is but its running on this computer Xeon Platinum 8249C. It running WUs at the same time though so there is performance loss there. |
Send message Joined: 2 May 07 Posts: 1883 Credit: 145,309,938 RAC: 103,175 ![]() ![]() ![]() |
eoscms show this error for me: root://eoscms.cern.ch//eos/cms/store/logs/prod/recent/TESTBED/ireid_TC_SLC7_IDR_CMS_Home_220223_155724_7756/SinglePiE50HCAL_pythia8_2018_GenSimFull/vocms0267.cern.ch-150443-0-log.tar.gz ErrorCode : 60311 ErrorType : GeneralStageOutFailure two hour no new job inside of the task now. |
![]() Send message Joined: 29 Aug 05 Posts: 961 Credit: 6,231,115 RAC: 0 ![]() |
|
![]() Send message Joined: 29 Aug 05 Posts: 961 Credit: 6,231,115 RAC: 0 ![]() |
|
![]() Send message Joined: 29 Aug 05 Posts: 961 Credit: 6,231,115 RAC: 0 ![]() |
There seems to be a general problem with CMS jobs at the moment (if you click on the backarrow on one of the job graphs -- not your browser's backarrow -- you can select to see data from all CMS sites, not just ours). There were a few hours of reporting no jobs running; now that is more normal but completed jobs are still showing zero. This suggests a database problem, but I've not yet seen any reports of outages. ![]() |
![]() Send message Joined: 29 Aug 05 Posts: 961 Credit: 6,231,115 RAC: 0 ![]() |
There seems to be a general problem with CMS jobs at the moment (if you click on the backarrow on one of the job graphs -- not your browser's backarrow -- you can select to see data from all CMS sites, not just ours). There were a few hours of reporting no jobs running; now that is more normal but completed jobs are still showing zero. This suggests a database problem, but I've not yet seen any reports of outages. Other production sources were showing completed production jobs again, but not us. A DBS component of our WMAgent was restarted and now we are showing completed jobs again, too. ![]() |
Send message Joined: 17 Aug 17 Posts: 39 Credit: 3,870,615 RAC: 2,032 ![]() ![]() ![]() |
I have 4 jobs in my queue that are just sat at waiting to run, even when I pause all other jobs? |
©2023 CERN