Message boards : CMS Application : Merge Jobs escaped?
Message board moderation

To post messages, you must log in.

AuthorMessage
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,906,714
RAC: 137,996
Message 28775 - Posted: 3 Feb 2017, 11:01:05 UTC

I am currently seeing dozens of downloads like the following while the CPU is nearly idle:

[03/Feb/2017:11:44:44 +0100] "GET http://vc-cms-output.cs3.cern.ch/unmerged/logs/prod/2017/2/3/ireid_MonteCarlo_eff_IDR_CMS_Home_170129_122247_3856/Production/0023/0/5c914ef8-e9d5-11e6-9115-02163e018309-623-0-logArchive.tar.gz? HTTP/1.1" 200 162500 "-" "gfal2-util/1.3.2 gfal2/2.11.1 neon/0.0.29" TCP_MISS:HIER_DIRECT
[03/Feb/2017:11:44:50 +0100] "GET http://vc-cms-output.cs3.cern.ch/unmerged/logs/prod/2017/2/3/ireid_MonteCarlo_eff_IDR_CMS_Home_170129_122247_3856/Production/0023/0/5c914ef8-e9d5-11e6-9115-02163e018309-912-0-logArchive.tar.gz? HTTP/1.1" 200 164637 "-" "gfal2-util/1.3.2 gfal2/2.11.1 neon/0.0.29" TCP_MISS:HIER_DIRECT
[03/Feb/2017:11:44:56 +0100] "GET http://vc-cms-output.cs3.cern.ch/unmerged/logs/prod/2017/2/3/ireid_MonteCarlo_eff_IDR_CMS_Home_170129_122247_3856/Production/0023/0/5c914ef8-e9d5-11e6-9115-02163e018309-702-0-logArchive.tar.gz? HTTP/1.1" 200 163641 "-" "gfal2-util/1.3.2 gfal2/2.11.1 neon/0.0.29" TCP_MISS:HIER_DIRECT
[03/Feb/2017:11:45:02 +0100] "GET http://vc-cms-output.cs3.cern.ch/unmerged/logs/prod/2017/2/3/ireid_MonteCarlo_eff_IDR_CMS_Home_170129_122247_3856/Production/0024/0/d852627a-e9e4-11e6-9115-02163e018309-122-0-logArchive.tar.gz? HTTP/1.1" 200 162555 "-" "gfal2-util/1.3.2 gfal2/2.11.1 neon/0.0.29" TCP_MISS:HIER_DIRECT
[03/Feb/2017:11:45:08 +0100] "GET http://vc-cms-output.cs3.cern.ch/unmerged/logs/prod/2017/2/3/ireid_MonteCarlo_eff_IDR_CMS_Home_170129_122247_3856/Production/0023/0/5c914ef8-e9d5-11e6-9115-02163e018309-620-0-logArchive.tar.gz? HTTP/1.1" 200 163602 "-" "gfal2-util/1.3.2 gfal2/2.11.1 neon/0.0.29" TCP_MISS:HIER_DIRECT
[03/Feb/2017:11:45:14 +0100] "GET http://vc-cms-output.cs3.cern.ch/unmerged/logs/prod/2017/2/3/ireid_MonteCarlo_eff_IDR_CMS_Home_170129_122247_3856/Production/0023/0/5c914ef8-e9d5-11e6-9115-02163e018309-532-0-logArchive.tar.gz? HTTP/1.1" 200 165181 "-" "gfal2-util/1.3.2 gfal2/2.11.1 neon/0.0.29" TCP_MISS:HIER_DIRECT
[03/Feb/2017:11:45:20 +0100] "GET http://vc-cms-output.cs3.cern.ch/unmerged/logs/prod/2017/2/3/ireid_MonteCarlo_eff_IDR_CMS_Home_170129_122247_3856/Production/0023/0/5c914ef8-e9d5-11e6-9115-02163e018309-512-0-logArchive.tar.gz? HTTP/1.1" 200 161051 "-" "gfal2-util/1.3.2 gfal2/2.11.1 neon/0.0.29" TCP_MISS:HIER_DIRECT
[03/Feb/2017:11:45:26 +0100] "GET http://vc-cms-output.cs3.cern.ch/unmerged/logs/prod/2017/2/3/ireid_MonteCarlo_eff_IDR_CMS_Home_170129_122247_3856/Production/0023/0/5c914ef8-e9d5-11e6-9115-02163e018309-613-0-logArchive.tar.gz? HTTP/1.1" 200 165700 "-" "gfal2-util/1.3.2 gfal2/2.11.1 neon/0.0.29" TCP_MISS:HIER_DIRECT
[03/Feb/2017:11:45:32 +0100] "GET http://vc-cms-output.cs3.cern.ch/unmerged/logs/prod/2017/2/3/ireid_MonteCarlo_eff_IDR_CMS_Home_170129_122247_3856/Production/0024/0/d852627a-e9e4-11e6-9115-02163e018309-57-0-logArchive.tar.gz? HTTP/1.1" 200 162480 "-" "gfal2-util/1.3.2 gfal2/2.11.1 neon/0.0.29" TCP_MISS:HIER_DIRECT
[03/Feb/2017:11:45:38 +0100] "GET http://vc-cms-output.cs3.cern.ch/unmerged/logs/prod/2017/2/3/ireid_MonteCarlo_eff_IDR_CMS_Home_170129_122247_3856/Production/0023/0/5c914ef8-e9d5-11e6-9115-02163e018309-197-0-logArchive.tar.gz? HTTP/1.1" 200 163014 "-" "gfal2-util/1.3.2 gfal2/2.11.1 neon/0.0.29" TCP_MISS:HIER_DIRECT
[03/Feb/2017:11:45:44 +0100] "GET http://vc-cms-output.cs3.cern.ch/unmerged/logs/prod/2017/2/3/ireid_MonteCarlo_eff_IDR_CMS_Home_170129_122247_3856/Production/0023/0/5c914ef8-e9d5-11e6-9115-02163e018309-989-0-logArchive.tar.gz? HTTP/1.1" 200 162294 "-" "gfal2-util/1.3.2 gfal2/2.11.1 neon/0.0.29" TCP_MISS:HIER_DIRECT
[03/Feb/2017:11:45:50 +0100] "GET http://vc-cms-output.cs3.cern.ch/unmerged/logs/prod/2017/2/3/ireid_MonteCarlo_eff_IDR_CMS_Home_170129_122247_3856/Production/0023/0/5c914ef8-e9d5-11e6-9115-02163e018309-984-0-logArchive.tar.gz? HTTP/1.1" 200 161827 "-" "gfal2-util/1.3.2 gfal2/2.11.1 neon/0.0.29" TCP_MISS:HIER_DIRECT
[03/Feb/2017:11:45:56 +0100] "GET http://vc-cms-output.cs3.cern.ch/unmerged/logs/prod/2017/2/3/ireid_MonteCarlo_eff_IDR_CMS_Home_170129_122247_3856/Production/0023/0/5c914ef8-e9d5-11e6-9115-02163e018309-713-0-logArchive.tar.gz? HTTP/1.1" 200 166123 "-" "gfal2-util/1.3.2 gfal2/2.11.1 neon/0.0.29" TCP_MISS:HIER_DIRECT
[03/Feb/2017:11:46:01 +0100] "GET http://vc-cms-output.cs3.cern.ch/unmerged/logs/prod/2017/2/3/ireid_MonteCarlo_eff_IDR_CMS_Home_170129_122247_3856/Production/0023/0/5c914ef8-e9d5-11e6-9115-02163e018309-606-0-logArchive.tar.gz? HTTP/1.1" 200 162029 "-" "gfal2-util/1.3.2 gfal2/2.11.1 neon/0.0.29" TCP_MISS:HIER_DIRECT
[03/Feb/2017:11:46:08 +0100] "GET http://vc-cms-output.cs3.cern.ch/unmerged/logs/prod/2017/2/3/ireid_MonteCarlo_eff_IDR_CMS_Home_170129_122247_3856/Production/0024/0/d852627a-e9e4-11e6-9115-02163e018309-39-0-logArchive.tar.gz? HTTP/1.1" 200 165348 "-" "gfal2-util/1.3.2 gfal2/2.11.1 neon/0.0.29" TCP_MISS:HIER_DIRECT


They seem to belong to a merge job.
Shouldn´t those jobs kept inside the CERN network?
ID: 28775 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,906,714
RAC: 137,996
Message 28776 - Posted: 3 Feb 2017, 12:52:15 UTC - in response to Message 28775.  

After the download of more than 500 files the job got stuck.
I noticed that the VM requested access to TCP port 1094.
This was previously not documented.

Any comments from the developers?
ID: 28776 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 997
Credit: 6,264,307
RAC: 71
Message 28777 - Posted: 3 Feb 2017, 14:25:33 UTC - in response to Message 28776.  
Last modified: 3 Feb 2017, 14:34:01 UTC

Yes, those are certainly unmerged jobs in the downloads, and no, they shouldn't be getting out. I'll make sure the crew is aware.

[Edit] Oh, hang on, those are log files! What's going on here? [/Edit]
ID: 28777 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 997
Credit: 6,264,307
RAC: 71
Message 28779 - Posted: 3 Feb 2017, 15:38:45 UTC - in response to Message 28776.  
Last modified: 3 Feb 2017, 15:48:14 UTC

After the download of more than 500 files the job got stuck.
I noticed that the VM requested access to TCP port 1094.
This was previously not documented.

Any comments from the developers?

Port 1094 is for rootd access, directly accessing remote files from ROOT programmes (and others). It's used in CMSSW to access data (.root) files stored on remote systems.
This is all very strange.
ID: 28779 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,906,714
RAC: 137,996
Message 28780 - Posted: 3 Feb 2017, 16:17:09 UTC - in response to Message 28779.  

After the download of more than 500 files the job got stuck.
I noticed that the VM requested access to TCP port 1094.
This was previously not documented.

Any comments from the developers?

Port 1094 is for rootd access, directly accessing remote files from ROOT programmes (and others). It's used in CMSSW to access data (.root) files stored on remote systems.
This is all very strange.


My VMs produce .root URLs as output of every job.
Normally they upload completely although TCP port 1094 is closed - except the highlighted one below.
That one would have been an error 151 some weeks before (you already explained that).

[03/Feb/2017:08:32:56 +0100] "PUT http://vc-cms-output.cs3.cern.ch/unmerged/DMWM_Test/QCD_Pt-40toInf_fwdJet_bwdJet_Tune4C_2p76TeV-pythia8/GEN-SIM/MonteCarlo_eff_CMS_Home_IDR_v2-v11/00022/9AE1FACC-D5E9-E611-A20A-080027DA302A.root? HTTP/1.1" 0 75423829 "-" "gfal2-util/1.3.2 gfal2/2.11.1 neon/0.0.29" TCP_MISS_ABORTED:HIER_DIRECT
[03/Feb/2017:08:58:25 +0100] "PUT http://vc-cms-output.cs3.cern.ch/unmerged/DMWM_Test/QCD_Pt-40toInf_fwdJet_bwdJet_Tune4C_2p76TeV-pythia8/GEN-SIM/MonteCarlo_eff_CMS_Home_IDR_v2-v11/00022/001B67D8-D7E9-E611-9243-080027AFEF7C.root? HTTP/1.1" 200 68874340 "-" "gfal2-util/1.3.2 gfal2/2.11.1 neon/0.0.29" TCP_MISS:HIER_DIRECT
[03/Feb/2017:10:06:12 +0100] "PUT http://vc-cms-output.cs3.cern.ch/unmerged/DMWM_Test/QCD_Pt-40toInf_fwdJet_bwdJet_Tune4C_2p76TeV-pythia8/GEN-SIM/MonteCarlo_eff_CMS_Home_IDR_v2-v11/00023/E2348D5F-E3E9-E611-A954-080027DA302A.root? HTTP/1.1" 200 75157412 "-" "gfal2-util/1.3.2 gfal2/2.11.1 neon/0.0.29" TCP_MISS:HIER_DIRECT

I would like to close TCP port 1094 if it is not necessary for normal tasks.
ID: 28780 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 997
Credit: 6,264,307
RAC: 71
Message 28782 - Posted: 3 Feb 2017, 16:38:59 UTC - in response to Message 28780.  

After the download of more than 500 files the job got stuck.
I noticed that the VM requested access to TCP port 1094.
This was previously not documented.

Any comments from the developers?

Port 1094 is for rootd access, directly accessing remote files from ROOT programmes (and others). It's used in CMSSW to access data (.root) files stored on remote systems.
This is all very strange.


My VMs produce .root URLs as output of every job.
Normally they upload completely although TCP port 1094 is closed - except the highlighted one below.
That one would have been an error 151 some weeks before (you already explained that).

As far as I'm aware, we don't use (x)rootd to write files, we stage-out with gfal-cp. rootd is usually used to read files which are to be further processed, it's the mechanism by which we can access our data files on remotes systems, and by using xrootd "redirectors" we don't even need to know where they are.

[03/Feb/2017:08:32:56 +0100] "PUT http://vc-cms-output.cs3.cern.ch/unmerged/DMWM_Test/QCD_Pt-40toInf_fwdJet_bwdJet_Tune4C_2p76TeV-pythia8/GEN-SIM/MonteCarlo_eff_CMS_Home_IDR_v2-v11/00022/9AE1FACC-D5E9-E611-A20A-080027DA302A.root? HTTP/1.1" 0 75423829 "-" "gfal2-util/1.3.2 gfal2/2.11.1 neon/0.0.29" TCP_MISS_ABORTED:HIER_DIRECT
[03/Feb/2017:08:58:25 +0100] "PUT http://vc-cms-output.cs3.cern.ch/unmerged/DMWM_Test/QCD_Pt-40toInf_fwdJet_bwdJet_Tune4C_2p76TeV-pythia8/GEN-SIM/MonteCarlo_eff_CMS_Home_IDR_v2-v11/00022/001B67D8-D7E9-E611-9243-080027AFEF7C.root? HTTP/1.1" 200 68874340 "-" "gfal2-util/1.3.2 gfal2/2.11.1 neon/0.0.29" TCP_MISS:HIER_DIRECT
[03/Feb/2017:10:06:12 +0100] "PUT http://vc-cms-output.cs3.cern.ch/unmerged/DMWM_Test/QCD_Pt-40toInf_fwdJet_bwdJet_Tune4C_2p76TeV-pythia8/GEN-SIM/MonteCarlo_eff_CMS_Home_IDR_v2-v11/00023/E2348D5F-E3E9-E611-A954-080027DA302A.root? HTTP/1.1" 200 75157412 "-" "gfal2-util/1.3.2 gfal2/2.11.1 neon/0.0.29" TCP_MISS:HIER_DIRECT

I would like to close TCP port 1094 if it is not necessary for normal tasks.

As your examples show, the gfal/davs protocols use HTTP (I believe over port 80) so I can't think of a reason for 1094 to be open.
ID: 28782 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 997
Credit: 6,264,307
RAC: 71
Message 28810 - Posted: 7 Feb 2017, 17:10:27 UTC - in response to Message 28777.  

Yes, those are certainly unmerged jobs in the downloads, and no, they shouldn't be getting out. I'll make sure the crew is aware.

[Edit] Oh, hang on, those are log files! What's going on here? [/Edit]

Ah! I've just found out that as well as Merge jobs there are also LogCollect jobs. I'll ask Laurence to try to identify them so they can be kept within CERN as well.
ID: 28810 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,906,714
RAC: 137,996
Message 28811 - Posted: 7 Feb 2017, 18:55:09 UTC - in response to Message 28810.  

... to try to identify them so they can be kept within CERN as well.

+1
ID: 28811 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 997
Credit: 6,264,307
RAC: 71
Message 28840 - Posted: 11 Feb 2017, 0:12:04 UTC - in response to Message 28811.  

OK, Laurence has just told me that LogCollect jobs are now being excluded from Volunteer machines as well as Merge jobs. From a scan of the Condor queue on Friday afternoon, I believe these are the only two categories of CMS_JobType apart from Production (which is the category Volunteers can process). If anyone spots another category, please let me know!
ID: 28840 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,906,714
RAC: 137,996
Message 28842 - Posted: 11 Feb 2017, 8:49:35 UTC - in response to Message 28840.  

thx
ID: 28842 · Report as offensive     Reply Quote

Message boards : CMS Application : Merge Jobs escaped?


©2024 CERN