Message boards :
CMS Application :
Result stage-out failures
Message board moderation
Previous · 1 · 2 · 3 · Next
Author | Message |
---|---|
Send message Joined: 29 Aug 05 Posts: 997 Credit: 6,264,307 RAC: 71 |
|
Send message Joined: 29 Aug 05 Posts: 997 Credit: 6,264,307 RAC: 71 |
|
Send message Joined: 15 Jun 08 Posts: 2386 Credit: 222,924,935 RAC: 137,667 |
Had a CMS VM (https://lhcathome.cern.ch/lhcathome/result.php?resultid=187800772) that didn't do any real work for 20 min and there is a rising red peak on the dashboard graphics. In addition my firewall rejects connections to eoscms-srv-m1.cern.ch (137.138.19.8), TCP port 1094, which were most likely initiated by that CMS VM. It may either be a configuration error or the port should be documented in the FAQ. |
Send message Joined: 15 Jun 08 Posts: 2386 Credit: 222,924,935 RAC: 137,667 |
Addon. Hope this snippet from a currently running VM helps: <@========== WMException Start ==========@> Exception Class: StageOutError Message: Command exited non-zero, ExitCode:151 Output: stdout: Local File Size is: 172086 Remote File Size is: ERROR: Size Mismatch between local and SE stderr: Run: [FATAL] Connection error [FATAL] Connection error [FATAL] Connection error ErrorCode : 60311 ModuleName : WMCore.Storage.StageOutError MethodName : __init__ ErrorType : GeneralStageOutFailure ClassInstance : None FileName : /var/lib/condor/execute/dir_9859/job/WMCore.zip/WMCore/Storage/StageOutError.py ClassName : None Command : LOCAL_SIZE=`stat -c%s "/var/lib/condor/execute/dir_9859/job/WMTaskSpace/logArch1/logArchive.tar.gz"` echo "Local File Size is: $LOCAL_SIZE" env XRD_WRITERECOVERY=0 xrdcp --force --nopbar "/var/lib/condor/execute/dir_9859/job/WMTaskSpace/logArch1/logArchive.tar.gz" "root://eoscms.cern.ch//eos/cms/store/logs/prod/recent/fanzago_MonteCarlo_eff_RFF_CMS_Home_180417_115221_6781/Production/vocms0267.cern.ch-1877-0-log.tar.gz" REMOTE_SIZE=`xrdfs 'eoscms.cern.ch' stat '/eos/cms/store/logs/prod/recent/fanzago_MonteCarlo_eff_RFF_CMS_Home_180417_115221_6781/Production/vocms0267.cern.ch-1877-0-log.tar.gz' | grep Size | sed -r 's/.*Size:[ ]*([0-9]+).*/\1/'` echo "Remote File Size is: $REMOTE_SIZE" if [ $REMOTE_SIZE ] && [ $LOCAL_SIZE == $REMOTE_SIZE ]; then exit 0; else echo "ERROR: Size Mismatch between local and SE"; xrdfs 'eoscms.cern.ch' rm /eos/cms/store/logs/prod/recent/fanzago_MonteCarlo_eff_RFF_CMS_Home_180417_115221_6781/Production/vocms0267.cern.ch-1877-0-log.tar.gz ; exit 60311 ; fi LineNumber : 32 ErrorNr : 0 ExitCode : 151 Traceback: File "/var/lib/condor/execute/dir_9859/job/WMCore.zip/WMCore/Storage/StageOutImpl.py", line 216, in __call__ self.executeCommand(command) File "/var/lib/condor/execute/dir_9859/job/WMCore.zip/WMCore/Storage/StageOutImpl.py", line 94, in executeCommand raise StageOutError(msg, Command=command, ExitCode=exitCode) <@---------- WMException End ----------@> |
Send message Joined: 15 Jun 08 Posts: 2386 Credit: 222,924,935 RAC: 137,667 |
Starting with this WU, https://lhcathome.cern.ch/lhcathome/result.php?resultid=187798611, all of my WUs show very long idle periods. Always at the end of a calculation phase during stage-out. The idle periods sum up to a couple of hours for each WU and affect all of my hosts. It should be investigated. |
Send message Joined: 29 Aug 05 Posts: 997 Credit: 6,264,307 RAC: 71 |
|
Send message Joined: 29 Aug 05 Posts: 997 Credit: 6,264,307 RAC: 71 |
|
Send message Joined: 29 Aug 05 Posts: 997 Credit: 6,264,307 RAC: 71 |
|
Send message Joined: 29 Aug 05 Posts: 997 Credit: 6,264,307 RAC: 71 |
Ah! "This is a new feature of the latest release. Every single job uploads a tarball - with logs - to EOS /store/logs/prod/recent, however, it doesn't cause a job to fail if it fails to write to EOS." There is a flag to stop this behaviour but it's untested as yet. I've tried to set it in the next batch. If the submission looks good, I'll abort the current batch so we can see if the behaviour changes. |
Send message Joined: 15 Jun 08 Posts: 2386 Credit: 222,924,935 RAC: 137,667 |
Thank you, Ivan. ... however, it doesn't cause a job to fail if it fails to write to EOS." Not a fail but a long delay at the end of each subtask. Are the long delays caused by the firewall drops (see next paragraph)? Ah! "This is a new feature of the latest release. Every single job uploads a tarball - with logs - to EOS /store/logs/prod/recent ... So, the connections to eoscms-srv-m1.cern.ch (137.138.19.8), TCP port 1094 I mentioned here are now required to upload that tarball? If so, why wasn't that (or any other new requirement) announced and documented in the firewall FAQ (I know, not your responsibility, Ivan)? Additional question: Is it now possible to access /cvmfs/cms.cern.ch via a local squid or is that still unsolved? |
Send message Joined: 29 Aug 05 Posts: 997 Credit: 6,264,307 RAC: 71 |
Thank you, Ivan. Probably a combination of that and the transfer time-out, tho' I'm told the time-out is 5 minutes (may be several retries, though...). Port 1094 does seem to be the default for xrootd protocol. but I would be surprised if you could write to eoscms even if the port were open, due to a lack of authentication. This new behaviour came out of CMS IT, not LHC@Home so it's not surprising they didn't think to tell us about it. My first effort at blocking the log storage failed so I'm hacking my way through python scripts while waiting for a more-detailed reply from the WMAgent experts on how exactly to specify it. You'll need someone with more knowledge than me to answer about the local squid. |
Send message Joined: 29 Aug 05 Posts: 997 Credit: 6,264,307 RAC: 71 |
I've had another go at suppressing the log writing to eoscms, and this time the JSON file does have my change. I'll let this batch run its course, as I don't want to have things go pear-shaped on a Friday night... I think it will run until sometime Sunday but it depends on how many cores crunch jobs between now and then. |
Send message Joined: 29 Aug 05 Posts: 997 Credit: 6,264,307 RAC: 71 |
OK, something funny was happening to the last batch; I believe that jobs were failing for some reason and being requeued, so the Dashboard plot showed a large proportion of failed jobs. So, I aborted it to let the new batch start. It's early days yet but it seems to be behaving better. Any feedback on whether I've suppressed the log-writing to eoscms? |
Send message Joined: 15 Jun 08 Posts: 2386 Credit: 222,924,935 RAC: 137,667 |
Any feedback on whether I've suppressed the log-writing to eoscms? Sorry, not yet as my hosts got the last CMS very early this morning (most likely old batch) and did Theory and LHCb during the day. Is it possible to see whether a fresh CMS WU is from the old batch or from the new batch? Do you need some parts from the log that makes it easier for you? |
Send message Joined: 29 Aug 05 Posts: 997 Credit: 6,264,307 RAC: 71 |
Any feedback on whether I've suppressed the log-writing to eoscms? Well, the old batch would have had "180417_120223_3057" in the filename string; the new one is "180420_181002_7017". I did notice on my home PC that stageouts to eoscms had an authorization(sic) failure. Unfortunately that task ended just a while ago, but one job has run in the current task (trying to get X11 windows here from my work computers is too slow...). I noticed this: INFO:root:StageOutMgr::__init__() INFO:root:StageOutMgr::__init__(): Override: {'eos-lfn-prefix': ''} INFO:root:=======StageOut Override: These are not the parameters you are looking for INFO:root:==== Stageout configuration start ==== INFO:root:Local Stage Out Implementation to be used is: gfal2 Local Stage Out PNN to be used is T3_CH_CMSAtHome Local Stage Out Catalog to be used is trivialcatalog_file:/cvmfs/cms.cern.ch/SITECONF/local/PhEDEx/storage.xml?protocol=https Trivial File Catalog has been loaded: lfn-to-pfn: protocol=https path-match-re=/+store/(.*) result=https://data-bridge.cern.ch/myfed/cms-output/$1 lfn-to-pfn: protocol=srmv2 path-match-re=/+store/(.*) result=https://data-bridge.cern.ch/myfed/cms-output/$1 lfn-to-pfn: protocol=http path-match-re=/+store/(.*) result=http://data-bridge.cern.ch/myfed/cms-output/$1 There are 0 fallback stage out definitions. INFO:root:==== Stageout configuration finish ==== INFO:root:==>Working on file: /store/unmerged/logs/prod/2018/4/21/ireid_MonteCarlo_eff2_IDR_CMS_Home_180420_181002_7017/Production/0000/0/5b032b5a-ffc8-45aa-a74f-94533cde62c7-231-0-logArchive.tar.gz INFO:root:===> Attempting Local Stage Out. INFO:root:LFN to PFN match made: LFN: /store/unmerged/logs/prod/2018/4/21/ireid_MonteCarlo_eff2_IDR_CMS_Home_180420_181002_7017/Production/0000/0/5b032b5a-ffc8-45aa-a74f-94533cde62c7-231-0-logArchive.tar.gz PFN: https://data-bridge.cern.ch/myfed/cms-output/unmerged/logs/prod/2018/4/21/ireid_MonteCarlo_eff2_IDR_CMS_Home_180420_181002_7017/Production/0000/0/5b032b5a-ffc8-45aa-a74f-94533cde62c7-231-0-logArchive.tar.gz INFO:root:Creating output directory... INFO:root:Running the stage out... INFO:root:Command exited with status: 0 Output message: stdout: Sat Apr 21 19:14:04 BST 2018 Copying 164587 bytes file:///var/lib/condor/execute/dir_4266/job/WMTaskSpace/logArch1/logArchive.tar.gz => https://data-bridge.cern.ch/myfed/cms-output/unmerged/logs/prod/2018/4/21/ireid_MonteCarlo_eff2_IDR_CMS_Home_180420_181002_7017/Production/0000/0/5b032b5a-ffc8-45aa-a74f-94533cde62c7-231-0-logArchive.tar.gz Bandwidth: 14200 gfal-copy exit status: 0so it did try to write the log file to data-bridge rather than eoscms. ...and in https://data-bridge.cern.ch/myfed/cms-output/unmerged/logs/prod/2018/4/21/ireid_MonteCarlo_eff2_IDR_CMS_Home_180420_181002_7017/Production/0000/0/ I find: Mode Links UID GID Size Modified Name -rwxr-xr-x 0 0 0 160.7K Sat, 21 Apr 2018 18:14:15 GMT [Metalink] 5b032b5a-ffc8-45aa-a74f-94533cde62c7-231-0-logArchive.tar.gz I think I might have found the magic sauce to overcome the latest hurdle put in our path. :-/ |
Send message Joined: 24 Oct 04 Posts: 1114 Credit: 49,501,728 RAC: 4,157 |
|
Send message Joined: 15 Jun 08 Posts: 2386 Credit: 222,924,935 RAC: 137,667 |
Not a fail but a long delay at the end of each subtask. It happened tonight that the long delay has gone away. Thanks to Ivan the knight. |
Send message Joined: 29 Aug 05 Posts: 997 Credit: 6,264,307 RAC: 71 |
|
Send message Joined: 15 Jun 08 Posts: 2386 Credit: 222,924,935 RAC: 137,667 |
The new CMS batch tries again to contact CERN via TCP port 1094 during stageout. This time to eoscms-srv-b2.cern.ch (188.184.36.173). If I correctly understand Ivan's comments starting here, this should not happen from outside the CERN network. The communication is blocked and reported by my firewall as port and target system are not (yet?) mentioned in the FAQs. |
Send message Joined: 24 Oct 04 Posts: 1114 Credit: 49,501,728 RAC: 4,157 |
Grrrrrrrr.....once again..... https://lhcathome.cern.ch/lhcathome/result.php?resultid=195573783 Only 3 Valids Volunteer Mad Scientist For Life |
©2024 CERN