Message boards :
CMS Application :
Problems connecting to servers?
Message board moderation
Author | Message |
---|---|
Send message Joined: 24 May 23 Posts: 48 Credit: 4,119,070 RAC: 18,377 ![]() ![]() ![]() |
I have two CMS tasks in this situation at the moment: 2024-08-15 19:39:54,399:ERROR:StageOutImpl:Attempt 1 to stage out failed. Automatically retrying in 300 secs Error details: <@========== WMException Start ==========@> Exception Class: StageOutError Message: Command exited non-zero, ExitCode:112 Output: stdout: Thu Aug 15 19:29:51 CEST 2024 WARNING Could not load the user credentials: impossible to open : : error:02001002:system library:fopen:No such file or directory WARNING Could not load the user credentials: impossible to open : : error:02001002:system library:fopen:No such file or directory WARNING (SEToken) Could not retrieve any token for https://vc-data-bridge.cern.ch/myfed/cms-output/store/unmerged/logs/prod/2024/8/15/ireid_TC_Backfill_IDR_CMS_Multi_240811_202049_5978/BPH_RunIISummer20UL18GEN_00262_0/0003/0/31d91523-0926-464c-bf63-fccaf9921303-550-0-logArchive.tar.gz WARNING Could not load the user credentials: impossible to open : : error:02001002:system library:fopen:No such file or directory WARNING Could not load the user credentials: impossible to open : : error:02001002:system library:fopen:No such file or directory WARNING (SEToken) Could not retrieve any token for https://vc-data-bridge.cern.ch/myfed/cms-output/store/unmerged/logs/prod/2024/8/15/ireid_TC_Backfill_IDR_CMS_Multi_240811_202049_5978/BPH_RunIISummer20UL18GEN_00262_0/0003/0/31d91523-0926-464c-bf63-fccaf9921303-550-0-logArchive.tar.gz Copying 1465769 bytes file:///srv/job/WMTaskSpace/logArch1/logArchive.tar.gz => https://vc-data-bridge.cern.ch/myfed/cms-output/store/unmerged/logs/prod/2024/8/15/ireid_TC_Backfill_IDR_CMS_Multi_240811_202049_5978/BPH_RunIISummer20UL18GEN_00262_0/0003/0/31d91523-0926-464c-bf63-fccaf9921303-550-0-logArchive.tar.gz event: [1723743292759] BOTH GFAL2:CORE:COPY LIST:ENTER event: [1723743292759] BOTH GFAL2:CORE:COPY LIST:ITEM file:///srv/job/WMTaskSpace/logArch1/logArchive.tar.gz => https://vc-data-bridge.cern.ch/myfed/cms-output/store/unmerged/logs/prod/2024/8/15/ireid_TC_Backfill_IDR_CMS_Multi_240811_202049_5978/BPH_RunIISummer20UL18GEN_00262_0/0003/0/31d91523-0926-464c-bf63-fccaf9921303-550-0-logArchive.tar.gz event: [1723743292759] BOTH GFAL2:CORE:COPY LIST:EXIT event: [1723743292759] BOTH http_plugin PREPARE:ENTER file:///srv/job/WMTaskSpace/logArch1/logArchive.tar.gz => https://vc-data-bridge.cern.ch/myfed/cms-output/store/unmerged/logs/prod/2024/8/15/ireid_TC_Backfill_IDR_CMS_Multi_240811_202049_5978/BPH_RunIISummer20UL18GEN_00262_0/0003/0/31d91523-0926-464c-bf63-fccaf9921303-550-0-logArchive.tar.gz WARNING Could not load the user credentials: impossible to open : : error:02001002:system library:fopen:No such file or directory WARNING Could not load the user credentials: impossible to open : : error:02001002:system library:fopen:No such file or directory WARNING (SEToken) Could not retrieve any token for https://vc-data-bridge.cern.ch/myfed/cms-output/store/unmerged/logs/prod/2024/8/15/ireid_TC_Backfill_IDR_CMS_Multi_240811_202049_5978/BPH_RunIISummer20UL18GEN_00262_0/0003/0/31d91523-0926-464c-bf63-fccaf9921303-550-0-logArchive.tar.gz gfal-copy exit status: 112 ERROR: gfal-copy exited with 112 Cleaning up failed file: Thu Aug 15 19:37:23 CEST 2024 https://vc-data-bridge.cern.ch/myfed/cms-output/store/unmerged/logs/prod/2024/8/15/ireid_TC_Backfill_IDR_CMS_Multi_240811_202049_5978/BPH_RunIISummer20UL18GEN_00262_0/0003/0/31d91523-0926-464c-bf63-fccaf9921303-550-0-logArchive.tar.gz FAILED stderr: /srv/startup_environment.sh: line 3: BASHOPTS: readonly variable /srv/startup_environment.sh: line 10: BASH_VERSINFO: readonly variable /srv/startup_environment.sh: line 33: EUID: readonly variable /srv/startup_environment.sh: line 148: PPID: readonly variable /srv/startup_environment.sh: line 156: SHELLOPTS: readonly variable /srv/startup_environment.sh: line 173: UID: readonly variable /srv/startup_environment.sh: line 203: syntax error near unexpected token `(' /srv/startup_environment.sh: line 203: `export probe_cvmfs_repos () ' gfal-copy error: 112 (Host is down) - DESTINATION OVERWRITE Result Could not connect to server after 1 attempts /srv/startup_environment.sh: line 3: BASHOPTS: readonly variable /srv/startup_environment.sh: line 10: BASH_VERSINFO: readonly variable /srv/startup_environment.sh: line 33: EUID: readonly variable /srv/startup_environment.sh: line 148: PPID: readonly variable /srv/startup_environment.sh: line 156: SHELLOPTS: readonly variable /srv/startup_environment.sh: line 173: UID: readonly variable /srv/startup_environment.sh: line 203: syntax error near unexpected token `(' /srv/startup_environment.sh: line 203: `export probe_cvmfs_repos () ' gfal-rm error: 112 (Host is down) - Result Could not connect to server after 1 attempts ClassName : None ModuleName : WMCore.Storage.StageOutError MethodName : __init__ ClassInstance : None FileName : /srv/job/WMCore.zip/WMCore/Storage/StageOutError.py LineNumber : 32 ErrorNr : 0 Command : #!/bin/bash env -i X509_USER_PROXY=$X509_USER_PROXY JOBSTARTDIR=$JOBSTARTDIR bash -c '. $JOBSTARTDIR/startup_environment.sh; date; gfal-copy -t 2400 -T 2400 -p -v --abort-on-failure file:///srv/job/WMTaskSpace/logArch1/logArchive.tar.gz https://vc-data-bridge.cern.ch/myfed/cms-output/store/unmerged/logs/prod/2024/8/15/ireid_TC_Backfill_IDR_CMS_Multi_240811_202049_5978/BPH_RunIISummer20UL18GEN_00262_0/0003/0/31d91523-0926-464c-bf63-fccaf9921303-550-0-logArchive.tar.gz' EXIT_STATUS=$? echo "gfal-copy exit status: $EXIT_STATUS" if [[ $EXIT_STATUS != 0 ]]; then echo "ERROR: gfal-copy exited with $EXIT_STATUS" echo "Cleaning up failed file:" env -i X509_USER_PROXY=$X509_USER_PROXY JOBSTARTDIR=$JOBSTARTDIR bash -c '. $JOBSTARTDIR/startup_environment.sh; date; gfal-rm -t 600 https://vc-data-bridge.cern.ch/myfed/cms-output/store/unmerged/logs/prod/2024/8/15/ireid_TC_Backfill_IDR_CMS_Multi_240811_202049_5978/BPH_RunIISummer20UL18GEN_00262_0/0003/0/31d91523-0926-464c-bf63-fccaf9921303-550-0-logArchive.tar.gz ' fi exit $EXIT_STATUS ExitCode : 112 ErrorCode : 60311 ErrorType : GeneralStageOutFailure Traceback: <@---------- WMException End ----------@> Bye. |
Send message Joined: 24 May 23 Posts: 48 Credit: 4,119,070 RAC: 18,377 ![]() ![]() ![]() |
It looks like they're running again, now. Bye. |
![]() Send message Joined: 29 Aug 05 Posts: 1072 Credit: 8,401,013 RAC: 5,916 ![]() |
|
![]() ![]() Send message Joined: 24 Oct 04 Posts: 1193 Credit: 58,946,960 RAC: 63,211 ![]() ![]() |
I see we had a few of these today but maybe it has been updated ( I stopped running them myself) https://lhcathome.cern.ch/lhcathome/result.php?resultid=413805611 https://lhcathome.cern.ch/lhcathome/result.php?resultid=413485212 VM Completion Message: Could not connect to all required network services |
Send message Joined: 2 May 07 Posts: 2260 Credit: 175,581,097 RAC: 15,522 ![]() ![]() ![]() |
Overnight 5 Jobs inside the CMS-Task finished. The 6. finished 6:28 UTC, after this the Task doing nothing. 2024-10-17 08:28:54,195:INFO:Report:addOutputFile method fileRef: , whole tree: {} 2024-10-17 08:28:54,195:INFO:LogArchive:Success job! Not saving its logs to CERN EOS recent area. 2024-10-17 08:28:54,196:INFO:LogArchive:Steps.Executors.LogArchive.post called 2024-10-17 08:28:54,197:INFO:ExecuteMaster:StepName: logArch1, StepType: LogArchive, with result: 0 2024-10-17 08:28:54,197:INFO:Watchdog:MonitorThread: JobEnded 2024-10-17 08:28:54,197:INFO:Watchdog:MonitorState: Shutdown called 2024-10-17 08:28:54,197:INFO:Startup:Completing task at directory: /srv/job/WMTaskSpace 2024-10-17 08:28:54,198:INFO:WMTask:Looking for master report at /srv/job/WMTaskSpace/../../Report.0.pkl 2024-10-17 08:28:54,198:INFO:WMTask: found it! 2024-10-17 08:28:54,198:INFO:WMTask:Looking for a taskStep report at /srv/job/WMTaskSpace/cmsRun1/Report.pkl 2024-10-17 08:28:54,198:INFO:WMTask: found it! 2024-10-17 08:28:54,199:INFO:WMTask:Looking for a taskStep report at /srv/job/WMTaskSpace/stageOut1/Report.pkl 2024-10-17 08:28:54,199:INFO:WMTask: found it! 2024-10-17 08:28:54,199:INFO:WMTask:Looking for a taskStep report at /srv/job/WMTaskSpace/logArch1/Report.pkl 2024-10-17 08:28:54,200:INFO:WMTask: found it! 2024-10-17 08:28:54,200:INFO:Startup:Shutting down monitor Does this Task waiting for the 18 hour shutdown? ISP had a disconnect at 06:20 UTC. This was the reason, sorry. |
Send message Joined: 2 May 07 Posts: 2260 Credit: 175,581,097 RAC: 15,522 ![]() ![]() ![]() |
Finished 8:36 UTC. This evening activating a new CMS-Task for overnight. |
Send message Joined: 2 May 07 Posts: 2260 Credit: 175,581,097 RAC: 15,522 ![]() ![]() ![]() |
Finished 8:36 UTC. Eight Jobs inside the Task finished successful! Laufzeit 14 Stunden 0 min. 0 sek. CPU Zeit 2 Tage 0 Stunden 32 min. 26 sek. |
Send message Joined: 3 Nov 12 Posts: 68 Credit: 150,046,597 RAC: 124,944 ![]() ![]() ![]() |
For now all WUs fail with [ERROR] Could not connect to vocms0840.cern.ch on port 9618 |
![]() ![]() Send message Joined: 24 Oct 04 Posts: 1193 Credit: 58,946,960 RAC: 63,211 ![]() ![]() |
Same here......if I hadn't looked just now I would have lost 100's of these CMS But I haven't checked them all yet so I might have VM Completion Message: Could not connect to all required network services (same thing at -dev btw) |
![]() Send message Joined: 7 Aug 11 Posts: 105 Credit: 26,099,112 RAC: 1,414 ![]() ![]() ![]() |
95 down with this. Here's the first, whatever happened to the server happened near the end of this unit https://lhcathome.cern.ch/lhcathome/result.php?resultid=416589013 2024-11-16 20:16:38 (1131410): Guest Log: NCAT DEBUG: Using system default trusted CA certificates and those in /usr/share/ncat/ca-bundle.crt. 2024-11-16 20:16:38 (1131410): Guest Log: NCAT DEBUG: Unable to load trusted CA certificates from /usr/share/ncat/ca-bundle.crt: error:02001002:system library:fopen:No such file or directory 2024-11-16 20:16:38 (1131410): Guest Log: libnsock nsi_new2(): nsi_new (IOD #1) 2024-11-16 20:16:38 (1131410): Guest Log: libnsock nsock_connect_tcp(): TCP connection requested to 137.138.156.85:9618 (IOD #1) EID 8 2024-11-16 20:16:38 (1131410): Guest Log: libnsock nsock_trace_handler_callback(): Callback: CONNECT TIMEOUT for EID 8 [137.138.156.85:9618] 2024-11-16 20:16:38 (1131410): Guest Log: Ncat: Connection timed out. 2024-11-16 20:16:38 (1131410): Guest Log: [ERROR] Could not connect to vocms0840.cern.ch on port 9618 2024-11-16 20:16:38 (1131410): Guest Log: [INFO] Testing connection to WMAgent 2024-11-16 20:16:39 (1131410): Guest Log: [INFO] Testing connection to EOSCMS 2024-11-16 20:16:39 (1131410): Guest Log: [INFO] Testing connection to CMS-Factory 2024-11-16 20:16:40 (1131410): Guest Log: [INFO] Testing connection to CMS-Frontier 2024-11-16 20:16:40 (1131410): Guest Log: [INFO] Testing connection to Frontier 2024-11-16 20:16:41 (1131410): Guest Log: [DEBUG] Check your firewall and your network load 2024-11-16 20:16:41 (1131410): Guest Log: [ERROR] Could not connect to all required network services 2024-11-16 20:16:41 (1131410): Guest Log: [DEBUG] Volunteer: Dark Angel (268818) 2024-11-16 20:16:41 (1131410): Guest Log: [INFO] Shutting Down. 2024-11-16 20:17:11 (1131410): VM Completion File Detected. 2024-11-16 20:17:11 (1131410): VM Completion Message: Could not connect to all required network services |
Send message Joined: 18 Dec 15 Posts: 1840 Credit: 126,183,857 RAC: 123,286 ![]() ![]() ![]() |
For now all WUs fail withtoo bad that I didn't notice it until this morning - thousands of failing tasks on my 20 hosts all night long :-( |
Send message Joined: 2 May 07 Posts: 2260 Credit: 175,581,097 RAC: 15,522 ![]() ![]() ![]() |
2024-11-17 10:53:17 (29336): Guest Log: Ncat: Version 7.50 ( https://nmap.org/ncat ) 2024-11-17 10:53:17 (29336): Guest Log: NCAT DEBUG: Using system default trusted CA certificates and those in /usr/share/ncat/ca-bundle.crt. 2024-11-17 10:53:17 (29336): Guest Log: NCAT DEBUG: Unable to load trusted CA certificates from /usr/share/ncat/ca-bundle.crt: error:02001002:system library:fopen:No such file or directory 2024-11-17 10:53:17 (29336): Guest Log: libnsock nsi_new2(): nsi_new (IOD #1) 2024-11-17 10:53:17 (29336): Guest Log: libnsock nsock_connect_tcp(): TCP connection requested to 137.138.156.85:9618 (IOD #1) EID 8 2024-11-17 10:53:17 (29336): Guest Log: libnsock nsock_trace_handler_callback(): Callback: CONNECT TIMEOUT for EID 8 [137.138.156.85:9618] |
![]() ![]() Send message Joined: 24 Oct 04 Posts: 1193 Credit: 58,946,960 RAC: 63,211 ![]() ![]() |
|
![]() Send message Joined: 29 Aug 05 Posts: 1072 Credit: 8,401,013 RAC: 5,916 ![]() |
2024-11-17 10:53:17 (29336): Guest Log: Ncat: Version 7.50 ( https://nmap.org/ncat ) I'm not sure exactly what the problem is. vocms0840 is to be taken out of service, replaced by a newer AlmaLinux9 VM, but we've been waiting for confirmation that the updated scripts are ready for the changeover. Either the machine has been taken out of service without our being notified, or it's developed a communication problem. ![]() |
![]() ![]() Send message Joined: 9 Feb 08 Posts: 55 Credit: 1,521,616 RAC: 3,319 ![]() ![]() ![]() |
My computer: OpenSuSE Tumbleweed [6.11.8-1-default|libc 2.40] i7-4790k, 32 GB RAM, 2 TB M.2 SSD, nVidia RTX 2060 (driver: 550.99 OpenCL: 3.0). Virtualbox (7.1.4_SUSEr165100) BOINC version 8.0.4 Full spec: 10860321 My PC looks like this - CMS tasks are failing to start. For example: Task Work unit Computer Sent Time reported Status Run CPU Application or deadline time time 416651829 227670244 10860321 17 Nov 2024, 19:45:47 UTC 20:07:00 UTC Error while computing 167.04 24.57 CMS Simulation v70.30 (vbox64_mt_mcore_cms) x86_64-pc-linux-gnu 416648976 227667391 10860321 17 Nov 2024, 18:29:17 UTC 19:45:47 UTC Error while computing 161.43 22.69 CMS Simulation v70.30 (vbox64_mt_mcore_cms) x86_64-pc-linux-gnu 416646675 227665093 10860321 17 Nov 2024, 17:19:14 UTC 18:23:04 UTC Error while computing 156.96 23.88 CMS Simulation v70.30 (vbox64_mt_mcore_cms) x86_64-pc-linux-gnu stderr for above tasks: 416651829 416648976 416646675 The following error occurs towards the end of each of the above stderr outpouts: ... 2024-11-17 20:03:37 (14664): Guest Log: [INFO] Testing connection to HTCondor 2024-11-17 20:03:53 (14664): Guest Log: [DEBUG] Status run 1 of up to 3: 1 2024-11-17 20:04:14 (14664): Guest Log: [DEBUG] Status run 2 of up to 3: 1 2024-11-17 20:04:39 (14664): Guest Log: [DEBUG] Status run 3 of up to 3: 1 2024-11-17 20:04:39 (14664): Guest Log: [DEBUG] run 1 2024-11-17 20:04:39 (14664): Guest Log: Ncat: Version 7.50 ( https://nmap.org/ncat ) 2024-11-17 20:04:39 (14664): Guest Log: Ncat: Connection timed out. 2024-11-17 20:04:39 (14664): Guest Log: run 2 2024-11-17 20:04:39 (14664): Guest Log: Ncat: Version 7.50 ( https://nmap.org/ncat ) 2024-11-17 20:04:39 (14664): Guest Log: Ncat: Connection timed out. 2024-11-17 20:04:39 (14664): Guest Log: run 3 2024-11-17 20:04:39 (14664): Guest Log: Ncat: Version 7.50 ( https://nmap.org/ncat ) 2024-11-17 20:04:39 (14664): Guest Log: NCAT DEBUG: Using system default trusted CA certificates and those in /usr/share/ncat/ca-bundle.crt. 2024-11-17 20:04:39 (14664): Guest Log: NCAT DEBUG: Unable to load trusted CA certificates from /usr/share/ncat/ca-bundle.crt: error:02001002:system library:fopen:No such file or directory 2024-11-17 20:04:39 (14664): Guest Log: libnsock nsi_new2(): nsi_new (IOD #1) 2024-11-17 20:04:39 (14664): Guest Log: libnsock nsock_connect_tcp(): TCP connection requested to 137.138.156.85:9618 (IOD #1) EID 8 2024-11-17 20:04:39 (14664): Guest Log: libnsock nsock_trace_handler_callback(): Callback: CONNECT TIMEOUT for EID 8 [137.138.156.85:9618] 2024-11-17 20:04:39 (14664): Guest Log: Ncat: Connection timed out. 2024-11-17 20:04:39 (14664): Guest Log: [ERROR] Could not connect to vocms0840.cern.ch on port 9618 2024-11-17 20:04:39 (14664): Guest Log: [INFO] Testing connection to WMAgent 2024-11-17 20:04:39 (14664): Guest Log: [INFO] Testing connection to EOSCMS 2024-11-17 20:04:40 (14664): Guest Log: [INFO] Testing connection to CMS-Factory 2024-11-17 20:04:40 (14664): Guest Log: [INFO] Testing connection to CMS-Frontier 2024-11-17 20:04:40 (14664): Guest Log: [INFO] Testing connection to Frontier 2024-11-17 20:04:40 (14664): Guest Log: [DEBUG] Check your firewall and your network load 2024-11-17 20:04:40 (14664): Guest Log: [ERROR] Could not connect to all required network services ... Any help with this would be welcome. Thanks. ![]() |
Send message Joined: 31 Oct 16 Posts: 2 Credit: 24,836,174 RAC: 13,863 ![]() ![]() ![]() |
Hej and Hello. I also got these errors today for all CMS tasks - they just run some minutes and then fail: ... 2024-11-17 21:03:46 (476): Guest Log: NCAT DEBUG: Using system default trusted CA certificates and those in /usr/share/ncat/ca-bundle.crt. 2024-11-17 21:03:46 (476): Guest Log: NCAT DEBUG: Unable to load trusted CA certificates from /usr/share/ncat/ca-bundle.crt: error:02001002:system library:fopen:No such file or directory 2024-11-17 21:03:46 (476): Guest Log: libnsock nsi_new2(): nsi_new (IOD #1) 2024-11-17 21:03:46 (476): Guest Log: libnsock nsock_connect_tcp(): TCP connection requested to 137.138.156.85:9618 (IOD #1) EID 8 2024-11-17 21:03:46 (476): Guest Log: libnsock nsock_trace_handler_callback(): Callback: CONNECT ERROR [Connection refused (111)] for EID 8 [137.138.156.85:9618] 2024-11-17 21:03:46 (476): Guest Log: Ncat: Connection refused. 2024-11-17 21:03:46 (476): Guest Log: [ERROR] Could not connect to vocms0840.cern.ch on port 9618 THX Sputnik |
Send message Joined: 2 May 07 Posts: 2260 Credit: 175,581,097 RAC: 15,522 ![]() ![]() ![]() |
Ivan have the answer, one message before yours. |
![]() Send message Joined: 29 Aug 05 Posts: 1072 Credit: 8,401,013 RAC: 5,916 ![]() |
The definitive answer is that the firewall to the HTCondor machine was closed on Saturday, before the Submission Infrastructure team were able to activate its substitute (there seems to have been at least one mix-up with service tickets being misdirected). Sorry about this, it's one of the disadvantages of such a long "supply chain" where people responsible for one part don't necessarily know who is dependent on it downstream <:frowny face:>. ![]() |
![]() ![]() Send message Joined: 9 Feb 08 Posts: 55 Credit: 1,521,616 RAC: 3,319 ![]() ![]() ![]() |
Thanks Ivan. I do appreciate you reconfirming this! The definitive answer is... |
Send message Joined: 18 Dec 15 Posts: 1840 Credit: 126,183,857 RAC: 123,286 ![]() ![]() ![]() |
The definitive answer is that the firewall to the HTCondor machine was closed on Saturday, before the Submission Infrastructure team were able to activate its substitute (there seems to have been at least one mix-up with service tickets being misdirected). Sorry about this, it's one of the disadvantages of such a long "supply chain" where people responsible for one part don't necessarily know who is dependent on it downstream <:frowny face:>.Thanks, Ivan, for the information. Could well be that it will take a while until everything works again. So: what about stopping the download queue until everything is straightened out? |
©2025 CERN