Result stage-out failures

Author	Message
ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1090 Credit: 9,319,923 RAC: 318	Message 34966 - Posted: 12 Apr 2018, 2:08:22 UTC - in response to Message 34965. OK, I can't see what the current problem is, it's not WMAgent nor DataBridge. But I've been up for 21 hours (the trip back was 9 hrs of that) so I'll look again when I wake up. Set No New Tasks if you are seeing task errors. ID: 34966 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1090 Credit: 9,319,923 RAC: 318	Message 34970 - Posted: 12 Apr 2018, 8:08:04 UTC - in response to Message 34966. Looks like it was a transient, thank $DEITY. ID: 34970 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2636 Credit: 276,976,306 RAC: 146,103	Message 35043 - Posted: 19 Apr 2018, 9:25:34 UTC Had a CMS VM (https://lhcathome.cern.ch/lhcathome/result.php?resultid=187800772) that didn't do any real work for 20 min and there is a rising red peak on the dashboard graphics. In addition my firewall rejects connections to eoscms-srv-m1.cern.ch (137.138.19.8), TCP port 1094, which were most likely initiated by that CMS VM. It may either be a configuration error or the port should be documented in the FAQ. ID: 35043 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2636 Credit: 276,976,306 RAC: 146,103	Message 35044 - Posted: 19 Apr 2018, 9:40:34 UTC - in response to Message 35043. Addon. Hope this snippet from a currently running VM helps: <@========== WMException Start ==========@> Exception Class: StageOutError Message: Command exited non-zero, ExitCode:151 Output: stdout: Local File Size is: 172086 Remote File Size is: ERROR: Size Mismatch between local and SE stderr: Run: [FATAL] Connection error [FATAL] Connection error [FATAL] Connection error ErrorCode : 60311 ModuleName : WMCore.Storage.StageOutError MethodName : __init__ ErrorType : GeneralStageOutFailure ClassInstance : None FileName : /var/lib/condor/execute/dir_9859/job/WMCore.zip/WMCore/Storage/StageOutError.py ClassName : None Command : LOCAL_SIZE=`stat -c%s "/var/lib/condor/execute/dir_9859/job/WMTaskSpace/logArch1/logArchive.tar.gz"` echo "Local File Size is: $LOCAL_SIZE" env XRD_WRITERECOVERY=0 xrdcp --force --nopbar "/var/lib/condor/execute/dir_9859/job/WMTaskSpace/logArch1/logArchive.tar.gz" "root://eoscms.cern.ch//eos/cms/store/logs/prod/recent/fanzago_MonteCarlo_eff_RFF_CMS_Home_180417_115221_6781/Production/vocms0267.cern.ch-1877-0-log.tar.gz" REMOTE_SIZE=`xrdfs 'eoscms.cern.ch' stat '/eos/cms/store/logs/prod/recent/fanzago_MonteCarlo_eff_RFF_CMS_Home_180417_115221_6781/Production/vocms0267.cern.ch-1877-0-log.tar.gz' \| grep Size \| sed -r 's/.Size:[ ]([0-9]+).*/\1/'` echo "Remote File Size is: $REMOTE_SIZE" if [ $REMOTE_SIZE ] && [ $LOCAL_SIZE == $REMOTE_SIZE ]; then exit 0; else echo "ERROR: Size Mismatch between local and SE"; xrdfs 'eoscms.cern.ch' rm /eos/cms/store/logs/prod/recent/fanzago_MonteCarlo_eff_RFF_CMS_Home_180417_115221_6781/Production/vocms0267.cern.ch-1877-0-log.tar.gz ; exit 60311 ; fi LineNumber : 32 ErrorNr : 0 ExitCode : 151 Traceback: File "/var/lib/condor/execute/dir_9859/job/WMCore.zip/WMCore/Storage/StageOutImpl.py", line 216, in __call__ self.executeCommand(command) File "/var/lib/condor/execute/dir_9859/job/WMCore.zip/WMCore/Storage/StageOutImpl.py", line 94, in executeCommand raise StageOutError(msg, Command=command, ExitCode=exitCode) <@---------- WMException End ----------@> ID: 35044 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2636 Credit: 276,976,306 RAC: 146,103	Message 35049 - Posted: 20 Apr 2018, 5:41:45 UTC Starting with this WU, https://lhcathome.cern.ch/lhcathome/result.php?resultid=187798611, all of my WUs show very long idle periods. Always at the end of a calculation phase during stage-out. The idle periods sum up to a couple of hours for each WU and affect all of my hosts. It should be investigated. ID: 35049 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1090 Credit: 9,319,923 RAC: 318	Message 35050 - Posted: 20 Apr 2018, 9:49:37 UTC - in response to Message 35044. I don't believe you should be trying to write to eoscms.cern.ch. I'll do a diff between Federica's JSON and mine. ID: 35050 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1090 Credit: 9,319,923 RAC: 318	Message 35051 - Posted: 20 Apr 2018, 10:30:48 UTC - in response to Message 35050. I don't see any unexpected differences between the two JSON scripts. Especially, there is no explicit mention of eoscms.cern.ch in either script. However, this is mentioned in the Trivial File Catalog -- we use the one for T2_CH_CERN. I need to investigate further. ID: 35051 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1090 Credit: 9,319,923 RAC: 318	Message 35052 - Posted: 20 Apr 2018, 12:27:40 UTC - in response to Message 35049. Ah! "This is a new feature of the latest release. Every single job uploads a tarball - with logs - to EOS /store/logs/prod/recent, however, it doesn't cause a job to fail if it fails to write to EOS." ID: 35052 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1090 Credit: 9,319,923 RAC: 318	Message 35054 - Posted: 20 Apr 2018, 13:31:40 UTC - in response to Message 35052. Ah! "This is a new feature of the latest release. Every single job uploads a tarball - with logs - to EOS /store/logs/prod/recent, however, it doesn't cause a job to fail if it fails to write to EOS." There is a flag to stop this behaviour but it's untested as yet. I've tried to set it in the next batch. If the submission looks good, I'll abort the current batch so we can see if the behaviour changes. ID: 35054 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2636 Credit: 276,976,306 RAC: 146,103	Message 35055 - Posted: 20 Apr 2018, 13:49:09 UTC - in response to Message 35052. Thank you, Ivan. ... however, it doesn't cause a job to fail if it fails to write to EOS." Not a fail but a long delay at the end of each subtask. Are the long delays caused by the firewall drops (see next paragraph)? Ah! "This is a new feature of the latest release. Every single job uploads a tarball - with logs - to EOS /store/logs/prod/recent ... So, the connections to eoscms-srv-m1.cern.ch (137.138.19.8), TCP port 1094 I mentioned here are now required to upload that tarball? If so, why wasn't that (or any other new requirement) announced and documented in the firewall FAQ (I know, not your responsibility, Ivan)? Additional question: Is it now possible to access /cvmfs/cms.cern.ch via a local squid or is that still unsolved? ID: 35055 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1090 Credit: 9,319,923 RAC: 318	Message 35056 - Posted: 20 Apr 2018, 15:51:32 UTC - in response to Message 35055. Thank you, Ivan. ... however, it doesn't cause a job to fail if it fails to write to EOS." Not a fail but a long delay at the end of each subtask. Are the long delays caused by the firewall drops (see next paragraph)? Ah! "This is a new feature of the latest release. Every single job uploads a tarball - with logs - to EOS /store/logs/prod/recent ... So, the connections to eoscms-srv-m1.cern.ch (137.138.19.8), TCP port 1094 I mentioned here are now required to upload that tarball? If so, why wasn't that (or any other new requirement) announced and documented in the firewall FAQ (I know, not your responsibility, Ivan)? Additional question: Is it now possible to access /cvmfs/cms.cern.ch via a local squid or is that still unsolved? Probably a combination of that and the transfer time-out, tho' I'm told the time-out is 5 minutes (may be several retries, though...). Port 1094 does seem to be the default for xrootd protocol. but I would be surprised if you could write to eoscms even if the port were open, due to a lack of authentication. This new behaviour came out of CMS IT, not LHC@Home so it's not surprising they didn't think to tell us about it. My first effort at blocking the log storage failed so I'm hacking my way through python scripts while waiting for a more-detailed reply from the WMAgent experts on how exactly to specify it. You'll need someone with more knowledge than me to answer about the local squid. ID: 35056 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1090 Credit: 9,319,923 RAC: 318	Message 35058 - Posted: 20 Apr 2018, 16:23:19 UTC - in response to Message 35056. I've had another go at suppressing the log writing to eoscms, and this time the JSON file does have my change. I'll let this batch run its course, as I don't want to have things go pear-shaped on a Friday night... I think it will run until sometime Sunday but it depends on how many cores crunch jobs between now and then. ID: 35058 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1090 Credit: 9,319,923 RAC: 318	Message 35065 - Posted: 21 Apr 2018, 16:01:44 UTC - in response to Message 35058. OK, something funny was happening to the last batch; I believe that jobs were failing for some reason and being requeued, so the Dashboard plot showed a large proportion of failed jobs. So, I aborted it to let the new batch start. It's early days yet but it seems to be behaving better. Any feedback on whether I've suppressed the log-writing to eoscms? ID: 35065 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2636 Credit: 276,976,306 RAC: 146,103	Message 35066 - Posted: 21 Apr 2018, 17:53:39 UTC - in response to Message 35065. Any feedback on whether I've suppressed the log-writing to eoscms? Sorry, not yet as my hosts got the last CMS very early this morning (most likely old batch) and did Theory and LHCb during the day. Is it possible to see whether a fresh CMS WU is from the old batch or from the new batch? Do you need some parts from the log that makes it easier for you? ID: 35066 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1090 Credit: 9,319,923 RAC: 318	Message 35067 - Posted: 21 Apr 2018, 18:45:22 UTC - in response to Message 35066. Any feedback on whether I've suppressed the log-writing to eoscms? Sorry, not yet as my hosts got the last CMS very early this morning (most likely old batch) and did Theory and LHCb during the day. Is it possible to see whether a fresh CMS WU is from the old batch or from the new batch? Do you need some parts from the log that makes it easier for you? Well, the old batch would have had "180417_120223_3057" in the filename string; the new one is "180420_181002_7017". I did notice on my home PC that stageouts to eoscms had an authorization(sic) failure. Unfortunately that task ended just a while ago, but one job has run in the current task (trying to get X11 windows here from my work computers is too slow...). I noticed this: INFO:root:StageOutMgr::__init__() INFO:root:StageOutMgr::__init__(): Override: {'eos-lfn-prefix': ''} INFO:root:=======StageOut Override: These are not the parameters you are looking for INFO:root:==== Stageout configuration start ==== INFO:root:Local Stage Out Implementation to be used is: gfal2 Local Stage Out PNN to be used is T3_CH_CMSAtHome Local Stage Out Catalog to be used is trivialcatalog_file:/cvmfs/cms.cern.ch/SITECONF/local/PhEDEx/storage.xml?protocol=https Trivial File Catalog has been loaded: lfn-to-pfn: protocol=https path-match-re=/+store/(.) result=https://data-bridge.cern.ch/myfed/cms-output/$1 lfn-to-pfn: protocol=srmv2 path-match-re=/+store/(.) result=https://data-bridge.cern.ch/myfed/cms-output/$1 lfn-to-pfn: protocol=http path-match-re=/+store/(.) result=http://data-bridge.cern.ch/myfed/cms-output/$1 There are 0 fallback stage out definitions. INFO:root:==== Stageout configuration finish ==== INFO:root:==>Working on file: /store/unmerged/logs/prod/2018/4/21/ireid_MonteCarlo_eff2_IDR_CMS_Home_180420_181002_7017/Production/0000/0/5b032b5a-ffc8-45aa-a74f-94533cde62c7-231-0-logArchive.tar.gz INFO:root:===> Attempting Local Stage Out. INFO:root:LFN to PFN match made: LFN: /store/unmerged/logs/prod/2018/4/21/ireid_MonteCarlo_eff2_IDR_CMS_Home_180420_181002_7017/Production/0000/0/5b032b5a-ffc8-45aa-a74f-94533cde62c7-231-0-logArchive.tar.gz PFN: https://data-bridge.cern.ch/myfed/cms-output/unmerged/logs/prod/2018/4/21/ireid_MonteCarlo_eff2_IDR_CMS_Home_180420_181002_7017/Production/0000/0/5b032b5a-ffc8-45aa-a74f-94533cde62c7-231-0-logArchive.tar.gz INFO:root:Creating output directory... INFO:root:Running the stage out... INFO:root:Command exited with status: 0 Output message: stdout: Sat Apr 21 19:14:04 BST 2018 Copying 164587 bytes file:///var/lib/condor/execute/dir_4266/job/WMTaskSpace/logArch1/logArchive.tar.gz => https://data-bridge.cern.ch/myfed/cms-output/unmerged/logs/prod/2018/4/21/ireid_MonteCarlo_eff2_IDR_CMS_Home_180420_181002_7017/Production/0000/0/5b032b5a-ffc8-45aa-a74f-94533cde62c7-231-0-logArchive.tar.gz Bandwidth: 14200 gfal-copy exit status: 0 so it did try to write the log file to data-bridge rather than eoscms. ...and in https://data-bridge.cern.ch/myfed/cms-output/unmerged/logs/prod/2018/4/21/ireid_MonteCarlo_eff2_IDR_CMS_Home_180420_181002_7017/Production/0000/0/ I find: Mode Links UID GID Size Modified Name -rwxr-xr-x 0 0 0 160.7K Sat, 21 Apr 2018 18:14:15 GMT [Metalink] 5b032b5a-ffc8-45aa-a74f-94533cde62c7-231-0-logArchive.tar.gz I think I might have found the magic sauce to overcome the latest hurdle put in our path. :-/* ID: 35067 · Reply Quote

Magic Quantum Mechanic Send message Joined: 24 Oct 04 Posts: 1205 Credit: 71,404,222 RAC: 107,628	Message 35068 - Posted: 21 Apr 2018, 18:51:51 UTC No problems here Ivan ID: 35068 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2636 Credit: 276,976,306 RAC: 146,103	Message 35069 - Posted: 22 Apr 2018, 6:09:01 UTC Not a fail but a long delay at the end of each subtask. It happened tonight that the long delay has gone away. Thanks to Ivan the knight. ID: 35069 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1090 Credit: 9,319,923 RAC: 318	Message 35072 - Posted: 22 Apr 2018, 9:39:52 UTC - in response to Message 35069. Thanks, but we might have picked up another problem -- now we're not successfully running merge jobs whereas before they were 100% successful. They are in "cool off" rather than "failure" so I guess they will be retried. These run on Laurence's T3_CH_CMSAtHome VM farm, not on your machines. ID: 35072 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2636 Credit: 276,976,306 RAC: 146,103	Message 35394 - Posted: 28 May 2018, 17:56:56 UTC The new CMS batch tries again to contact CERN via TCP port 1094 during stageout. This time to eoscms-srv-b2.cern.ch (188.184.36.173). If I correctly understand Ivan's comments starting here, this should not happen from outside the CERN network. The communication is blocked and reported by my firewall as port and target system are not (yet?) mentioned in the FAQs. ID: 35394 · Reply Quote

Magic Quantum Mechanic Send message Joined: 24 Oct 04 Posts: 1205 Credit: 71,404,222 RAC: 107,628	Message 35413 - Posted: 2 Jun 2018, 7:53:10 UTC Last modified: 2 Jun 2018, 7:55:52 UTC Grrrrrrrr.....once again..... https://lhcathome.cern.ch/lhcathome/result.php?resultid=195573783 Only 3 Valids Volunteer Mad Scientist For Life ID: 35413 · Reply Quote

LHC@home