Message boards :
CMS Application :
Problems connecting to servers?
Message board moderation
| Author | Message |
|---|---|
|
Send message Joined: 24 May 23 Posts: 56 Credit: 5,836,980 RAC: 67,634 |
I have two CMS tasks in this situation at the moment: 2024-08-15 19:39:54,399:ERROR:StageOutImpl:Attempt 1 to stage out failed.
Automatically retrying in 300 secs
Error details:
<@========== WMException Start ==========@>
Exception Class: StageOutError
Message: Command exited non-zero, ExitCode:112
Output: stdout: Thu Aug 15 19:29:51 CEST 2024
WARNING Could not load the user credentials: impossible to open : : error:02001002:system library:fopen:No such file or directory
WARNING Could not load the user credentials: impossible to open : : error:02001002:system library:fopen:No such file or directory
WARNING (SEToken) Could not retrieve any token for https://vc-data-bridge.cern.ch/myfed/cms-output/store/unmerged/logs/prod/2024/8/15/ireid_TC_Backfill_IDR_CMS_Multi_240811_202049_5978/BPH_RunIISummer20UL18GEN_00262_0/0003/0/31d91523-0926-464c-bf63-fccaf9921303-550-0-logArchive.tar.gz
WARNING Could not load the user credentials: impossible to open : : error:02001002:system library:fopen:No such file or directory
WARNING Could not load the user credentials: impossible to open : : error:02001002:system library:fopen:No such file or directory
WARNING (SEToken) Could not retrieve any token for https://vc-data-bridge.cern.ch/myfed/cms-output/store/unmerged/logs/prod/2024/8/15/ireid_TC_Backfill_IDR_CMS_Multi_240811_202049_5978/BPH_RunIISummer20UL18GEN_00262_0/0003/0/31d91523-0926-464c-bf63-fccaf9921303-550-0-logArchive.tar.gz
Copying 1465769 bytes file:///srv/job/WMTaskSpace/logArch1/logArchive.tar.gz => https://vc-data-bridge.cern.ch/myfed/cms-output/store/unmerged/logs/prod/2024/8/15/ireid_TC_Backfill_IDR_CMS_Multi_240811_202049_5978/BPH_RunIISummer20UL18GEN_00262_0/0003/0/31d91523-0926-464c-bf63-fccaf9921303-550-0-logArchive.tar.gz
event: [1723743292759] BOTH GFAL2:CORE:COPY LIST:ENTER
event: [1723743292759] BOTH GFAL2:CORE:COPY LIST:ITEM file:///srv/job/WMTaskSpace/logArch1/logArchive.tar.gz => https://vc-data-bridge.cern.ch/myfed/cms-output/store/unmerged/logs/prod/2024/8/15/ireid_TC_Backfill_IDR_CMS_Multi_240811_202049_5978/BPH_RunIISummer20UL18GEN_00262_0/0003/0/31d91523-0926-464c-bf63-fccaf9921303-550-0-logArchive.tar.gz
event: [1723743292759] BOTH GFAL2:CORE:COPY LIST:EXIT
event: [1723743292759] BOTH http_plugin PREPARE:ENTER file:///srv/job/WMTaskSpace/logArch1/logArchive.tar.gz => https://vc-data-bridge.cern.ch/myfed/cms-output/store/unmerged/logs/prod/2024/8/15/ireid_TC_Backfill_IDR_CMS_Multi_240811_202049_5978/BPH_RunIISummer20UL18GEN_00262_0/0003/0/31d91523-0926-464c-bf63-fccaf9921303-550-0-logArchive.tar.gz
WARNING Could not load the user credentials: impossible to open : : error:02001002:system library:fopen:No such file or directory
WARNING Could not load the user credentials: impossible to open : : error:02001002:system library:fopen:No such file or directory
WARNING (SEToken) Could not retrieve any token for https://vc-data-bridge.cern.ch/myfed/cms-output/store/unmerged/logs/prod/2024/8/15/ireid_TC_Backfill_IDR_CMS_Multi_240811_202049_5978/BPH_RunIISummer20UL18GEN_00262_0/0003/0/31d91523-0926-464c-bf63-fccaf9921303-550-0-logArchive.tar.gz
gfal-copy exit status: 112
ERROR: gfal-copy exited with 112
Cleaning up failed file:
Thu Aug 15 19:37:23 CEST 2024
https://vc-data-bridge.cern.ch/myfed/cms-output/store/unmerged/logs/prod/2024/8/15/ireid_TC_Backfill_IDR_CMS_Multi_240811_202049_5978/BPH_RunIISummer20UL18GEN_00262_0/0003/0/31d91523-0926-464c-bf63-fccaf9921303-550-0-logArchive.tar.gz FAILED
stderr: /srv/startup_environment.sh: line 3: BASHOPTS: readonly variable
/srv/startup_environment.sh: line 10: BASH_VERSINFO: readonly variable
/srv/startup_environment.sh: line 33: EUID: readonly variable
/srv/startup_environment.sh: line 148: PPID: readonly variable
/srv/startup_environment.sh: line 156: SHELLOPTS: readonly variable
/srv/startup_environment.sh: line 173: UID: readonly variable
/srv/startup_environment.sh: line 203: syntax error near unexpected token `('
/srv/startup_environment.sh: line 203: `export probe_cvmfs_repos () '
gfal-copy error: 112 (Host is down) - DESTINATION OVERWRITE Result Could not connect to server after 1 attempts
/srv/startup_environment.sh: line 3: BASHOPTS: readonly variable
/srv/startup_environment.sh: line 10: BASH_VERSINFO: readonly variable
/srv/startup_environment.sh: line 33: EUID: readonly variable
/srv/startup_environment.sh: line 148: PPID: readonly variable
/srv/startup_environment.sh: line 156: SHELLOPTS: readonly variable
/srv/startup_environment.sh: line 173: UID: readonly variable
/srv/startup_environment.sh: line 203: syntax error near unexpected token `('
/srv/startup_environment.sh: line 203: `export probe_cvmfs_repos () '
gfal-rm error: 112 (Host is down) - Result Could not connect to server after 1 attempts
ClassName : None
ModuleName : WMCore.Storage.StageOutError
MethodName : __init__
ClassInstance : None
FileName : /srv/job/WMCore.zip/WMCore/Storage/StageOutError.py
LineNumber : 32
ErrorNr : 0
Command : #!/bin/bash
env -i X509_USER_PROXY=$X509_USER_PROXY JOBSTARTDIR=$JOBSTARTDIR bash -c '. $JOBSTARTDIR/startup_environment.sh; date; gfal-copy -t 2400 -T 2400 -p -v --abort-on-failure file:///srv/job/WMTaskSpace/logArch1/logArchive.tar.gz https://vc-data-bridge.cern.ch/myfed/cms-output/store/unmerged/logs/prod/2024/8/15/ireid_TC_Backfill_IDR_CMS_Multi_240811_202049_5978/BPH_RunIISummer20UL18GEN_00262_0/0003/0/31d91523-0926-464c-bf63-fccaf9921303-550-0-logArchive.tar.gz'
EXIT_STATUS=$?
echo "gfal-copy exit status: $EXIT_STATUS"
if [[ $EXIT_STATUS != 0 ]]; then
echo "ERROR: gfal-copy exited with $EXIT_STATUS"
echo "Cleaning up failed file:"
env -i X509_USER_PROXY=$X509_USER_PROXY JOBSTARTDIR=$JOBSTARTDIR bash -c '. $JOBSTARTDIR/startup_environment.sh; date; gfal-rm -t 600 https://vc-data-bridge.cern.ch/myfed/cms-output/store/unmerged/logs/prod/2024/8/15/ireid_TC_Backfill_IDR_CMS_Multi_240811_202049_5978/BPH_RunIISummer20UL18GEN_00262_0/0003/0/31d91523-0926-464c-bf63-fccaf9921303-550-0-logArchive.tar.gz '
fi
exit $EXIT_STATUS
ExitCode : 112
ErrorCode : 60311
ErrorType : GeneralStageOutFailure
Traceback:
<@---------- WMException End ----------@>
Bye. |
|
Send message Joined: 24 May 23 Posts: 56 Credit: 5,836,980 RAC: 67,634 |
It looks like they're running again, now. Bye. |
|
Send message Joined: 29 Aug 05 Posts: 1119 Credit: 10,397,050 RAC: 19,398 |
|
Magic Quantum MechanicSend message Joined: 24 Oct 04 Posts: 1242 Credit: 85,094,474 RAC: 136,241 |
I see we had a few of these today but maybe it has been updated ( I stopped running them myself) https://lhcathome.cern.ch/lhcathome/result.php?resultid=413805611 https://lhcathome.cern.ch/lhcathome/result.php?resultid=413485212 VM Completion Message: Could not connect to all required network services |
|
Send message Joined: 2 May 07 Posts: 2278 Credit: 178,775,457 RAC: 1,891 |
Overnight 5 Jobs inside the CMS-Task finished. The 6. finished 6:28 UTC, after this the Task doing nothing. 2024-10-17 08:28:54,195:INFO:Report:addOutputFile method fileRef: , whole tree: {} 2024-10-17 08:28:54,195:INFO:LogArchive:Success job! Not saving its logs to CERN EOS recent area. 2024-10-17 08:28:54,196:INFO:LogArchive:Steps.Executors.LogArchive.post called 2024-10-17 08:28:54,197:INFO:ExecuteMaster:StepName: logArch1, StepType: LogArchive, with result: 0 2024-10-17 08:28:54,197:INFO:Watchdog:MonitorThread: JobEnded 2024-10-17 08:28:54,197:INFO:Watchdog:MonitorState: Shutdown called 2024-10-17 08:28:54,197:INFO:Startup:Completing task at directory: /srv/job/WMTaskSpace 2024-10-17 08:28:54,198:INFO:WMTask:Looking for master report at /srv/job/WMTaskSpace/../../Report.0.pkl 2024-10-17 08:28:54,198:INFO:WMTask: found it! 2024-10-17 08:28:54,198:INFO:WMTask:Looking for a taskStep report at /srv/job/WMTaskSpace/cmsRun1/Report.pkl 2024-10-17 08:28:54,198:INFO:WMTask: found it! 2024-10-17 08:28:54,199:INFO:WMTask:Looking for a taskStep report at /srv/job/WMTaskSpace/stageOut1/Report.pkl 2024-10-17 08:28:54,199:INFO:WMTask: found it! 2024-10-17 08:28:54,199:INFO:WMTask:Looking for a taskStep report at /srv/job/WMTaskSpace/logArch1/Report.pkl 2024-10-17 08:28:54,200:INFO:WMTask: found it! 2024-10-17 08:28:54,200:INFO:Startup:Shutting down monitor Does this Task waiting for the 18 hour shutdown? ISP had a disconnect at 06:20 UTC. This was the reason, sorry. |
|
Send message Joined: 2 May 07 Posts: 2278 Credit: 178,775,457 RAC: 1,891 |
Finished 8:36 UTC. This evening activating a new CMS-Task for overnight. |
|
Send message Joined: 2 May 07 Posts: 2278 Credit: 178,775,457 RAC: 1,891 |
Finished 8:36 UTC. Eight Jobs inside the Task finished successful! Laufzeit 14 Stunden 0 min. 0 sek. CPU Zeit 2 Tage 0 Stunden 32 min. 26 sek. |
|
Send message Joined: 3 Nov 12 Posts: 76 Credit: 179,153,756 RAC: 282,886 |
For now all WUs fail with [ERROR] Could not connect to vocms0840.cern.ch on port 9618 |
Magic Quantum MechanicSend message Joined: 24 Oct 04 Posts: 1242 Credit: 85,094,474 RAC: 136,241 |
Same here......if I hadn't looked just now I would have lost 100's of these CMS But I haven't checked them all yet so I might have VM Completion Message: Could not connect to all required network services (same thing at -dev btw) |
|
Send message Joined: 7 Aug 11 Posts: 119 Credit: 31,246,111 RAC: 67,283 |
95 down with this. Here's the first, whatever happened to the server happened near the end of this unit https://lhcathome.cern.ch/lhcathome/result.php?resultid=416589013 2024-11-16 20:16:38 (1131410): Guest Log: NCAT DEBUG: Using system default trusted CA certificates and those in /usr/share/ncat/ca-bundle.crt. 2024-11-16 20:16:38 (1131410): Guest Log: NCAT DEBUG: Unable to load trusted CA certificates from /usr/share/ncat/ca-bundle.crt: error:02001002:system library:fopen:No such file or directory 2024-11-16 20:16:38 (1131410): Guest Log: libnsock nsi_new2(): nsi_new (IOD #1) 2024-11-16 20:16:38 (1131410): Guest Log: libnsock nsock_connect_tcp(): TCP connection requested to 137.138.156.85:9618 (IOD #1) EID 8 2024-11-16 20:16:38 (1131410): Guest Log: libnsock nsock_trace_handler_callback(): Callback: CONNECT TIMEOUT for EID 8 [137.138.156.85:9618] 2024-11-16 20:16:38 (1131410): Guest Log: Ncat: Connection timed out. 2024-11-16 20:16:38 (1131410): Guest Log: [ERROR] Could not connect to vocms0840.cern.ch on port 9618 2024-11-16 20:16:38 (1131410): Guest Log: [INFO] Testing connection to WMAgent 2024-11-16 20:16:39 (1131410): Guest Log: [INFO] Testing connection to EOSCMS 2024-11-16 20:16:39 (1131410): Guest Log: [INFO] Testing connection to CMS-Factory 2024-11-16 20:16:40 (1131410): Guest Log: [INFO] Testing connection to CMS-Frontier 2024-11-16 20:16:40 (1131410): Guest Log: [INFO] Testing connection to Frontier 2024-11-16 20:16:41 (1131410): Guest Log: [DEBUG] Check your firewall and your network load 2024-11-16 20:16:41 (1131410): Guest Log: [ERROR] Could not connect to all required network services 2024-11-16 20:16:41 (1131410): Guest Log: [DEBUG] Volunteer: Dark Angel (268818) 2024-11-16 20:16:41 (1131410): Guest Log: [INFO] Shutting Down. 2024-11-16 20:17:11 (1131410): VM Completion File Detected. 2024-11-16 20:17:11 (1131410): VM Completion Message: Could not connect to all required network services |
|
Send message Joined: 18 Dec 15 Posts: 1923 Credit: 149,489,868 RAC: 143,780 |
For now all WUs fail withtoo bad that I didn't notice it until this morning - thousands of failing tasks on my 20 hosts all night long :-( |
|
Send message Joined: 2 May 07 Posts: 2278 Credit: 178,775,457 RAC: 1,891 |
2024-11-17 10:53:17 (29336): Guest Log: Ncat: Version 7.50 ( https://nmap.org/ncat ) 2024-11-17 10:53:17 (29336): Guest Log: NCAT DEBUG: Using system default trusted CA certificates and those in /usr/share/ncat/ca-bundle.crt. 2024-11-17 10:53:17 (29336): Guest Log: NCAT DEBUG: Unable to load trusted CA certificates from /usr/share/ncat/ca-bundle.crt: error:02001002:system library:fopen:No such file or directory 2024-11-17 10:53:17 (29336): Guest Log: libnsock nsi_new2(): nsi_new (IOD #1) 2024-11-17 10:53:17 (29336): Guest Log: libnsock nsock_connect_tcp(): TCP connection requested to 137.138.156.85:9618 (IOD #1) EID 8 2024-11-17 10:53:17 (29336): Guest Log: libnsock nsock_trace_handler_callback(): Callback: CONNECT TIMEOUT for EID 8 [137.138.156.85:9618] |
Magic Quantum MechanicSend message Joined: 24 Oct 04 Posts: 1242 Credit: 85,094,474 RAC: 136,241 |
|
|
Send message Joined: 29 Aug 05 Posts: 1119 Credit: 10,397,050 RAC: 19,398 |
2024-11-17 10:53:17 (29336): Guest Log: Ncat: Version 7.50 ( https://nmap.org/ncat ) I'm not sure exactly what the problem is. vocms0840 is to be taken out of service, replaced by a newer AlmaLinux9 VM, but we've been waiting for confirmation that the updated scripts are ready for the changeover. Either the machine has been taken out of service without our being notified, or it's developed a communication problem. |
GuySend message Joined: 9 Feb 08 Posts: 61 Credit: 2,161,811 RAC: 3,668 |
My computer: OpenSuSE Tumbleweed [6.11.8-1-default|libc 2.40] i7-4790k, 32 GB RAM, 2 TB M.2 SSD, nVidia RTX 2060 (driver: 550.99 OpenCL: 3.0). Virtualbox (7.1.4_SUSEr165100) BOINC version 8.0.4 Full spec: 10860321 My PC looks like this - CMS tasks are failing to start. For example:
Task Work unit Computer Sent Time reported Status Run CPU Application
or deadline time time416651829 227670244 10860321 17 Nov 2024, 19:45:47 UTC 20:07:00 UTC Error while computing 167.04 24.57 CMS Simulation v70.30 (vbox64_mt_mcore_cms)
x86_64-pc-linux-gnu
416648976 227667391 10860321 17 Nov 2024, 18:29:17 UTC 19:45:47 UTC Error while computing 161.43 22.69 CMS Simulation v70.30 (vbox64_mt_mcore_cms)
x86_64-pc-linux-gnu
416646675 227665093 10860321 17 Nov 2024, 17:19:14 UTC 18:23:04 UTC Error while computing 156.96 23.88 CMS Simulation v70.30 (vbox64_mt_mcore_cms)
x86_64-pc-linux-gnustderr for above tasks: 416651829 416648976 416646675 The following error occurs towards the end of each of the above stderr outpouts: ... 2024-11-17 20:03:37 (14664): Guest Log: [INFO] Testing connection to HTCondor 2024-11-17 20:03:53 (14664): Guest Log: [DEBUG] Status run 1 of up to 3: 1 2024-11-17 20:04:14 (14664): Guest Log: [DEBUG] Status run 2 of up to 3: 1 2024-11-17 20:04:39 (14664): Guest Log: [DEBUG] Status run 3 of up to 3: 1 2024-11-17 20:04:39 (14664): Guest Log: [DEBUG] run 1 2024-11-17 20:04:39 (14664): Guest Log: Ncat: Version 7.50 ( https://nmap.org/ncat ) 2024-11-17 20:04:39 (14664): Guest Log: Ncat: Connection timed out. 2024-11-17 20:04:39 (14664): Guest Log: run 2 2024-11-17 20:04:39 (14664): Guest Log: Ncat: Version 7.50 ( https://nmap.org/ncat ) 2024-11-17 20:04:39 (14664): Guest Log: Ncat: Connection timed out. 2024-11-17 20:04:39 (14664): Guest Log: run 3 2024-11-17 20:04:39 (14664): Guest Log: Ncat: Version 7.50 ( https://nmap.org/ncat ) 2024-11-17 20:04:39 (14664): Guest Log: NCAT DEBUG: Using system default trusted CA certificates and those in /usr/share/ncat/ca-bundle.crt. 2024-11-17 20:04:39 (14664): Guest Log: NCAT DEBUG: Unable to load trusted CA certificates from /usr/share/ncat/ca-bundle.crt: error:02001002:system library:fopen:No such file or directory 2024-11-17 20:04:39 (14664): Guest Log: libnsock nsi_new2(): nsi_new (IOD #1) 2024-11-17 20:04:39 (14664): Guest Log: libnsock nsock_connect_tcp(): TCP connection requested to 137.138.156.85:9618 (IOD #1) EID 8 2024-11-17 20:04:39 (14664): Guest Log: libnsock nsock_trace_handler_callback(): Callback: CONNECT TIMEOUT for EID 8 [137.138.156.85:9618] 2024-11-17 20:04:39 (14664): Guest Log: Ncat: Connection timed out. 2024-11-17 20:04:39 (14664): Guest Log: [ERROR] Could not connect to vocms0840.cern.ch on port 9618 2024-11-17 20:04:39 (14664): Guest Log: [INFO] Testing connection to WMAgent 2024-11-17 20:04:39 (14664): Guest Log: [INFO] Testing connection to EOSCMS 2024-11-17 20:04:40 (14664): Guest Log: [INFO] Testing connection to CMS-Factory 2024-11-17 20:04:40 (14664): Guest Log: [INFO] Testing connection to CMS-Frontier 2024-11-17 20:04:40 (14664): Guest Log: [INFO] Testing connection to Frontier 2024-11-17 20:04:40 (14664): Guest Log: [DEBUG] Check your firewall and your network load 2024-11-17 20:04:40 (14664): Guest Log: [ERROR] Could not connect to all required network services ... Any help with this would be welcome. Thanks.
|
|
Send message Joined: 31 Oct 16 Posts: 2 Credit: 29,949,137 RAC: 67,745 |
Hej and Hello. I also got these errors today for all CMS tasks - they just run some minutes and then fail: ... 2024-11-17 21:03:46 (476): Guest Log: NCAT DEBUG: Using system default trusted CA certificates and those in /usr/share/ncat/ca-bundle.crt. 2024-11-17 21:03:46 (476): Guest Log: NCAT DEBUG: Unable to load trusted CA certificates from /usr/share/ncat/ca-bundle.crt: error:02001002:system library:fopen:No such file or directory 2024-11-17 21:03:46 (476): Guest Log: libnsock nsi_new2(): nsi_new (IOD #1) 2024-11-17 21:03:46 (476): Guest Log: libnsock nsock_connect_tcp(): TCP connection requested to 137.138.156.85:9618 (IOD #1) EID 8 2024-11-17 21:03:46 (476): Guest Log: libnsock nsock_trace_handler_callback(): Callback: CONNECT ERROR [Connection refused (111)] for EID 8 [137.138.156.85:9618] 2024-11-17 21:03:46 (476): Guest Log: Ncat: Connection refused. 2024-11-17 21:03:46 (476): Guest Log: [ERROR] Could not connect to vocms0840.cern.ch on port 9618 THX Sputnik |
|
Send message Joined: 2 May 07 Posts: 2278 Credit: 178,775,457 RAC: 1,891 |
Ivan have the answer, one message before yours. |
|
Send message Joined: 29 Aug 05 Posts: 1119 Credit: 10,397,050 RAC: 19,398 |
The definitive answer is that the firewall to the HTCondor machine was closed on Saturday, before the Submission Infrastructure team were able to activate its substitute (there seems to have been at least one mix-up with service tickets being misdirected). Sorry about this, it's one of the disadvantages of such a long "supply chain" where people responsible for one part don't necessarily know who is dependent on it downstream <:frowny face:>. |
GuySend message Joined: 9 Feb 08 Posts: 61 Credit: 2,161,811 RAC: 3,668 |
Thanks Ivan. I do appreciate you reconfirming this! The definitive answer is... |
|
Send message Joined: 18 Dec 15 Posts: 1923 Credit: 149,489,868 RAC: 143,780 |
The definitive answer is that the firewall to the HTCondor machine was closed on Saturday, before the Submission Infrastructure team were able to activate its substitute (there seems to have been at least one mix-up with service tickets being misdirected). Sorry about this, it's one of the disadvantages of such a long "supply chain" where people responsible for one part don't necessarily know who is dependent on it downstream <:frowny face:>.Thanks, Ivan, for the information. Could well be that it will take a while until everything works again. So: what about stopping the download queue until everything is straightened out? |
©2025 CERN