Message boards : CMS Application : Result stage-out failures
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 696
Credit: 5,575,006
RAC: 2,272
Message 34966 - Posted: 12 Apr 2018, 2:08:22 UTC - in response to Message 34965.  

OK, I can't see what the current problem is, it's not WMAgent nor DataBridge. But I've been up for 21 hours (the trip back was 9 hrs of that) so I'll look again when I wake up. Set No New Tasks if you are seeing task errors.
ID: 34966 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 696
Credit: 5,575,006
RAC: 2,272
Message 34970 - Posted: 12 Apr 2018, 8:08:04 UTC - in response to Message 34966.  

Looks like it was a transient, thank $DEITY.
ID: 34970 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1504
Credit: 82,868,120
RAC: 78,988
Message 35043 - Posted: 19 Apr 2018, 9:25:34 UTC

Had a CMS VM (https://lhcathome.cern.ch/lhcathome/result.php?resultid=187800772) that didn't do any real work for 20 min and there is a rising red peak on the dashboard graphics.

In addition my firewall rejects connections to eoscms-srv-m1.cern.ch (137.138.19.8), TCP port 1094, which were most likely initiated by that CMS VM.
It may either be a configuration error or the port should be documented in the FAQ.
ID: 35043 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1504
Credit: 82,868,120
RAC: 78,988
Message 35044 - Posted: 19 Apr 2018, 9:40:34 UTC - in response to Message 35043.  

Addon.

Hope this snippet from a currently running VM helps:
<@========== WMException Start ==========@>
Exception Class: StageOutError
Message: Command exited non-zero, ExitCode:151
Output: stdout: Local File Size is: 172086
Remote File Size is: 
ERROR: Size Mismatch between local and SE

stderr: Run: [FATAL] Connection error
[FATAL] Connection error
[FATAL] Connection error
 
	ErrorCode : 60311
	ModuleName : WMCore.Storage.StageOutError
	MethodName : __init__
	ErrorType : GeneralStageOutFailure
	ClassInstance : None
	FileName : /var/lib/condor/execute/dir_9859/job/WMCore.zip/WMCore/Storage/StageOutError.py
	ClassName : None
	Command : LOCAL_SIZE=`stat -c%s "/var/lib/condor/execute/dir_9859/job/WMTaskSpace/logArch1/logArchive.tar.gz"`
echo "Local File Size is: $LOCAL_SIZE"
env XRD_WRITERECOVERY=0 xrdcp --force --nopbar  "/var/lib/condor/execute/dir_9859/job/WMTaskSpace/logArch1/logArchive.tar.gz"  "root://eoscms.cern.ch//eos/cms/store/logs/prod/recent/fanzago_MonteCarlo_eff_RFF_CMS_Home_180417_115221_6781/Production/vocms0267.cern.ch-1877-0-log.tar.gz" 
REMOTE_SIZE=`xrdfs 'eoscms.cern.ch' stat '/eos/cms/store/logs/prod/recent/fanzago_MonteCarlo_eff_RFF_CMS_Home_180417_115221_6781/Production/vocms0267.cern.ch-1877-0-log.tar.gz' | grep Size | sed -r 's/.*Size:[ ]*([0-9]+).*/\1/'`
echo "Remote File Size is: $REMOTE_SIZE"
if [ $REMOTE_SIZE ] && [ $LOCAL_SIZE == $REMOTE_SIZE ]; then exit 0; else echo "ERROR: Size Mismatch between local and SE"; xrdfs 'eoscms.cern.ch' rm /eos/cms/store/logs/prod/recent/fanzago_MonteCarlo_eff_RFF_CMS_Home_180417_115221_6781/Production/vocms0267.cern.ch-1877-0-log.tar.gz ; exit 60311 ; fi
	LineNumber : 32
	ErrorNr : 0
	ExitCode : 151

Traceback: 
  File "/var/lib/condor/execute/dir_9859/job/WMCore.zip/WMCore/Storage/StageOutImpl.py", line 216, in __call__
    self.executeCommand(command)

  File "/var/lib/condor/execute/dir_9859/job/WMCore.zip/WMCore/Storage/StageOutImpl.py", line 94, in executeCommand
    raise StageOutError(msg, Command=command, ExitCode=exitCode)

<@---------- WMException End ----------@>
ID: 35044 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1504
Credit: 82,868,120
RAC: 78,988
Message 35049 - Posted: 20 Apr 2018, 5:41:45 UTC

Starting with this WU, https://lhcathome.cern.ch/lhcathome/result.php?resultid=187798611, all of my WUs show very long idle periods.
Always at the end of a calculation phase during stage-out. The idle periods sum up to a couple of hours for each WU and affect all of my hosts.

It should be investigated.
ID: 35049 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 696
Credit: 5,575,006
RAC: 2,272
Message 35050 - Posted: 20 Apr 2018, 9:49:37 UTC - in response to Message 35044.  

I don't believe you should be trying to write to eoscms.cern.ch. I'll do a diff between Federica's JSON and mine.
ID: 35050 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 696
Credit: 5,575,006
RAC: 2,272
Message 35051 - Posted: 20 Apr 2018, 10:30:48 UTC - in response to Message 35050.  

I don't see any unexpected differences between the two JSON scripts. Especially, there is no explicit mention of eoscms.cern.ch in either script. However, this is mentioned in the Trivial File Catalog -- we use the one for T2_CH_CERN. I need to investigate further.
ID: 35051 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 696
Credit: 5,575,006
RAC: 2,272
Message 35052 - Posted: 20 Apr 2018, 12:27:40 UTC - in response to Message 35049.  

Ah! "This is a new feature of the latest release. Every single job uploads a tarball - with logs - to EOS /store/logs/prod/recent, however, it doesn't cause a job to fail if it fails to write to EOS."
ID: 35052 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 696
Credit: 5,575,006
RAC: 2,272
Message 35054 - Posted: 20 Apr 2018, 13:31:40 UTC - in response to Message 35052.  

Ah! "This is a new feature of the latest release. Every single job uploads a tarball - with logs - to EOS /store/logs/prod/recent, however, it doesn't cause a job to fail if it fails to write to EOS."

There is a flag to stop this behaviour but it's untested as yet. I've tried to set it in the next batch. If the submission looks good, I'll abort the current batch so we can see if the behaviour changes.
ID: 35054 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1504
Credit: 82,868,120
RAC: 78,988
Message 35055 - Posted: 20 Apr 2018, 13:49:09 UTC - in response to Message 35052.  

Thank you, Ivan.

... however, it doesn't cause a job to fail if it fails to write to EOS."

Not a fail but a long delay at the end of each subtask.
Are the long delays caused by the firewall drops (see next paragraph)?


Ah! "This is a new feature of the latest release. Every single job uploads a tarball - with logs - to EOS /store/logs/prod/recent ...

So, the connections to eoscms-srv-m1.cern.ch (137.138.19.8), TCP port 1094 I mentioned here are now required to upload that tarball?
If so, why wasn't that (or any other new requirement) announced and documented in the firewall FAQ (I know, not your responsibility, Ivan)?


Additional question:
Is it now possible to access /cvmfs/cms.cern.ch via a local squid or is that still unsolved?
ID: 35055 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 696
Credit: 5,575,006
RAC: 2,272
Message 35056 - Posted: 20 Apr 2018, 15:51:32 UTC - in response to Message 35055.  

Thank you, Ivan.

... however, it doesn't cause a job to fail if it fails to write to EOS."

Not a fail but a long delay at the end of each subtask.
Are the long delays caused by the firewall drops (see next paragraph)?


Ah! "This is a new feature of the latest release. Every single job uploads a tarball - with logs - to EOS /store/logs/prod/recent ...

So, the connections to eoscms-srv-m1.cern.ch (137.138.19.8), TCP port 1094 I mentioned here are now required to upload that tarball?
If so, why wasn't that (or any other new requirement) announced and documented in the firewall FAQ (I know, not your responsibility, Ivan)?


Additional question:
Is it now possible to access /cvmfs/cms.cern.ch via a local squid or is that still unsolved?

Probably a combination of that and the transfer time-out, tho' I'm told the time-out is 5 minutes (may be several retries, though...).
Port 1094 does seem to be the default for xrootd protocol. but I would be surprised if you could write to eoscms even if the port were open, due to a lack of authentication.
This new behaviour came out of CMS IT, not LHC@Home so it's not surprising they didn't think to tell us about it.
My first effort at blocking the log storage failed so I'm hacking my way through python scripts while waiting for a more-detailed reply from the WMAgent experts on how exactly to specify it.
You'll need someone with more knowledge than me to answer about the local squid.
ID: 35056 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 696
Credit: 5,575,006
RAC: 2,272
Message 35058 - Posted: 20 Apr 2018, 16:23:19 UTC - in response to Message 35056.  

I've had another go at suppressing the log writing to eoscms, and this time the JSON file does have my change. I'll let this batch run its course, as I don't want to have things go pear-shaped on a Friday night... I think it will run until sometime Sunday but it depends on how many cores crunch jobs between now and then.
ID: 35058 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 696
Credit: 5,575,006
RAC: 2,272
Message 35065 - Posted: 21 Apr 2018, 16:01:44 UTC - in response to Message 35058.  

OK, something funny was happening to the last batch; I believe that jobs were failing for some reason and being requeued, so the Dashboard plot showed a large proportion of failed jobs. So, I aborted it to let the new batch start. It's early days yet but it seems to be behaving better. Any feedback on whether I've suppressed the log-writing to eoscms?
ID: 35065 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1504
Credit: 82,868,120
RAC: 78,988
Message 35066 - Posted: 21 Apr 2018, 17:53:39 UTC - in response to Message 35065.  

Any feedback on whether I've suppressed the log-writing to eoscms?

Sorry, not yet as my hosts got the last CMS very early this morning (most likely old batch) and did Theory and LHCb during the day.

Is it possible to see whether a fresh CMS WU is from the old batch or from the new batch?
Do you need some parts from the log that makes it easier for you?
ID: 35066 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 696
Credit: 5,575,006
RAC: 2,272
Message 35067 - Posted: 21 Apr 2018, 18:45:22 UTC - in response to Message 35066.  

Any feedback on whether I've suppressed the log-writing to eoscms?

Sorry, not yet as my hosts got the last CMS very early this morning (most likely old batch) and did Theory and LHCb during the day.
Is it possible to see whether a fresh CMS WU is from the old batch or from the new batch?
Do you need some parts from the log that makes it easier for you?

Well, the old batch would have had "180417_120223_3057" in the filename string; the new one is "180420_181002_7017".
I did notice on my home PC that stageouts to eoscms had an authorization(sic) failure. Unfortunately that task ended just a while ago, but one job has run in the current task (trying to get X11 windows here from my work computers is too slow...). I noticed this:
INFO:root:StageOutMgr::__init__()
INFO:root:StageOutMgr::__init__(): Override: {'eos-lfn-prefix': ''}
INFO:root:=======StageOut Override: These are not the parameters you are looking for
INFO:root:==== Stageout configuration start ====
INFO:root:Local Stage Out Implementation to be used is: gfal2
Local Stage Out PNN to be used is T3_CH_CMSAtHome
Local Stage Out Catalog to be used is trivialcatalog_file:/cvmfs/cms.cern.ch/SITECONF/local/PhEDEx/storage.xml?protocol=https
Trivial File Catalog has been loaded:
	lfn-to-pfn: protocol=https path-match-re=/+store/(.*) result=https://data-bridge.cern.ch/myfed/cms-output/$1
	lfn-to-pfn: protocol=srmv2 path-match-re=/+store/(.*) result=https://data-bridge.cern.ch/myfed/cms-output/$1
	lfn-to-pfn: protocol=http path-match-re=/+store/(.*) result=http://data-bridge.cern.ch/myfed/cms-output/$1
There are 0 fallback stage out definitions.

INFO:root:==== Stageout configuration finish ====
INFO:root:==>Working on file: /store/unmerged/logs/prod/2018/4/21/ireid_MonteCarlo_eff2_IDR_CMS_Home_180420_181002_7017/Production/0000/0/5b032b5a-ffc8-45aa-a74f-94533cde62c7-231-0-logArchive.tar.gz
INFO:root:===> Attempting Local Stage Out.
INFO:root:LFN to PFN match made:
LFN: /store/unmerged/logs/prod/2018/4/21/ireid_MonteCarlo_eff2_IDR_CMS_Home_180420_181002_7017/Production/0000/0/5b032b5a-ffc8-45aa-a74f-94533cde62c7-231-0-logArchive.tar.gz
PFN: https://data-bridge.cern.ch/myfed/cms-output/unmerged/logs/prod/2018/4/21/ireid_MonteCarlo_eff2_IDR_CMS_Home_180420_181002_7017/Production/0000/0/5b032b5a-ffc8-45aa-a74f-94533cde62c7-231-0-logArchive.tar.gz

INFO:root:Creating output directory...
INFO:root:Running the stage out...
INFO:root:Command exited with status: 0
Output message: stdout: Sat Apr 21 19:14:04 BST 2018
Copying 164587 bytes file:///var/lib/condor/execute/dir_4266/job/WMTaskSpace/logArch1/logArchive.tar.gz => https://data-bridge.cern.ch/myfed/cms-output/unmerged/logs/prod/2018/4/21/ireid_MonteCarlo_eff2_IDR_CMS_Home_180420_181002_7017/Production/0000/0/5b032b5a-ffc8-45aa-a74f-94533cde62c7-231-0-logArchive.tar.gz
Bandwidth: 14200
gfal-copy exit status: 0
so it did try to write the log file to data-bridge rather than eoscms.
...and in https://data-bridge.cern.ch/myfed/cms-output/unmerged/logs/prod/2018/4/21/ireid_MonteCarlo_eff2_IDR_CMS_Home_180420_181002_7017/Production/0000/0/ I find:
Mode	Links	UID	GID	Size	Modified		Name
-rwxr-xr-x	0	0	0	160.7K	Sat, 21 Apr 2018 18:14:15 GMT	[Metalink]	5b032b5a-ffc8-45aa-a74f-94533cde62c7-231-0-logArchive.tar.gz

I think I might have found the magic sauce to overcome the latest hurdle put in our path. :-/
ID: 35067 · Report as offensive     Reply Quote
Profile MAGIC Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 960
Credit: 40,615,774
RAC: 5,560
Message 35068 - Posted: 21 Apr 2018, 18:51:51 UTC

ID: 35068 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1504
Credit: 82,868,120
RAC: 78,988
Message 35069 - Posted: 22 Apr 2018, 6:09:01 UTC

Not a fail but a long delay at the end of each subtask.

It happened tonight
that the long delay
has gone away.
Thanks to Ivan the knight.
ID: 35069 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 696
Credit: 5,575,006
RAC: 2,272
Message 35072 - Posted: 22 Apr 2018, 9:39:52 UTC - in response to Message 35069.  

Thanks, but we might have picked up another problem -- now we're not successfully running merge jobs whereas before they were 100% successful. They are in "cool off" rather than "failure" so I guess they will be retried. These run on Laurence's T3_CH_CMSAtHome VM farm, not on your machines.
ID: 35072 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1504
Credit: 82,868,120
RAC: 78,988
Message 35394 - Posted: 28 May 2018, 17:56:56 UTC

The new CMS batch tries again to contact CERN via TCP port 1094 during stageout.
This time to eoscms-srv-b2.cern.ch (188.184.36.173).
If I correctly understand Ivan's comments starting here, this should not happen from outside the CERN network.

The communication is blocked and reported by my firewall as port and target system are not (yet?) mentioned in the FAQs.
ID: 35394 · Report as offensive     Reply Quote
Profile MAGIC Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 960
Credit: 40,615,774
RAC: 5,560
Message 35413 - Posted: 2 Jun 2018, 7:53:10 UTC
Last modified: 2 Jun 2018, 7:55:52 UTC

Grrrrrrrr.....once again.....

https://lhcathome.cern.ch/lhcathome/result.php?resultid=195573783

Only 3 Valids
Volunteer Mad Scientist For Life
ID: 35413 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : CMS Application : Result stage-out failures


©2020 CERN