Message boards : CMS Application : Result stage-out failures
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 696
Credit: 5,575,006
RAC: 2,272
Message 34580 - Posted: 11 Mar 2018, 11:51:26 UTC

We appear to be having a problem with returning result files to the Data Bridge at CERN. Your jobs are running OK, but attempts to store the results are failing -- hence the big red spike you can see on the Dashboard plots (Jobs -> CMS Jobs).
I've opened a service ticket with CERN IT.
If you wish, you can stop running CMS tasks for now, or switch to another project, until the problem is solved.
ID: 34580 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 696
Credit: 5,575,006
RAC: 2,272
Message 34591 - Posted: 12 Mar 2018, 13:49:01 UTC - in response to Message 34580.  
Last modified: 12 Mar 2018, 14:00:51 UTC

I think the Data Bridge problem is solved now; the results graph is starting to show successes again.
[Edit] Confirmed. Results are arriving again. You may resume processing CMS tasks. [/Edit]
ID: 34591 · Report as offensive     Reply Quote
Profile MAGIC Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 960
Credit: 40,615,774
RAC: 5,560
Message 34604 - Posted: 13 Mar 2018, 1:55:46 UTC

ID: 34604 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1504
Credit: 82,863,173
RAC: 78,944
Message 34787 - Posted: 29 Mar 2018, 7:07:49 UTC

Nothing but errors on the status page since last night.
Is anybody from the project team available to check this?
ID: 34787 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 696
Credit: 5,575,006
RAC: 2,272
Message 34791 - Posted: 29 Mar 2018, 13:52:19 UTC - in response to Message 34787.  

Nothing but errors on the status page since last night.
Is anybody from the project team available to check this?

There was a problem with the Data Bridge last night. It has been corrected (and no, I know no more detail than that).
We're in a transitional stage at the moment, unfortunately with Easter looming, but the good bit of news I can impart is that as of a day or two ago we are writing merged result files into the T2_CH_CERN storage -- which means they are now available world-wide to anyone in CMS who wants to use them. *NOW* the challenge is to get workflows that suit our capabilities and limitations and produce results that end up in papers we can all point to with pride!
ID: 34791 · Report as offensive     Reply Quote
Profile MAGIC Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 960
Credit: 40,615,774
RAC: 5,560
Message 34793 - Posted: 29 Mar 2018, 19:32:03 UTC

I only had problems on March 27th

Only 5 tasks though and still had 9 Valids that day.

111 Valids and 13 Errors in the last week.
Volunteer Mad Scientist For Life
ID: 34793 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 696
Credit: 5,575,006
RAC: 2,272
Message 34806 - Posted: 30 Mar 2018, 12:03:58 UTC - in response to Message 34793.  

Apparently there have been network problems, etc., at CERN today. We seem to have a few jobs running again now but the WMAgent is flagging a problem, so I've messaged the relevant maintainers. Whether they are able to respond on Good Friday remains to be seen.
ID: 34806 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 696
Credit: 5,575,006
RAC: 2,272
Message 34815 - Posted: 30 Mar 2018, 19:08:59 UTC - in response to Message 34806.  
Last modified: 30 Mar 2018, 19:09:15 UTC

The failed component was restarted, and we're now showing green across the board. Thanks to the people who worked on it during a major holiday!
ID: 34815 · Report as offensive     Reply Quote
Profile MAGIC Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 960
Credit: 40,615,774
RAC: 5,560
Message 34821 - Posted: 30 Mar 2018, 23:02:45 UTC

&^%$$#@

I usually don't expect to see this when I check to see if I need more tasks
https://lhcathome.cern.ch/lhcathome/results.php?hostid=10451775

I hope a reload here doesn't do that again. (it is like a jinx every time I think things are running ok)
Volunteer Mad Scientist For Life
ID: 34821 · Report as offensive     Reply Quote
Profile rbpeake

Send message
Joined: 17 Sep 04
Posts: 76
Credit: 24,086,099
RAC: 0
Message 34825 - Posted: 31 Mar 2018, 2:33:13 UTC - in response to Message 34791.  


We're in a transitional stage at the moment, unfortunately with Easter looming, but the good bit of news I can impart is that as of a day or two ago we are writing merged result files into the T2_CH_CERN storage -- which means they are now available world-wide to anyone in CMS who wants to use them. *NOW* the challenge is to get workflows that suit our capabilities and limitations and produce results that end up in papers we can all point to with pride!

How were the results used before this?
Thanks!
Regards,
Bob P.
ID: 34825 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 696
Credit: 5,575,006
RAC: 2,272
Message 34831 - Posted: 31 Mar 2018, 10:15:53 UTC - in response to Message 34825.  


We're in a transitional stage at the moment, unfortunately with Easter looming, but the good bit of news I can impart is that as of a day or two ago we are writing merged result files into the T2_CH_CERN storage -- which means they are now available world-wide to anyone in CMS who wants to use them. *NOW* the challenge is to get workflows that suit our capabilities and limitations and produce results that end up in papers we can all point to with pride!

How were the results used before this?
Thanks!

Not widely, to be honest. To many people in CMS this is still "development", but we did run one workflow for over 6 months that couldn't get time on the GRID computers -- see this paper and also an overall view of LHC@home here.
ID: 34831 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1504
Credit: 82,863,173
RAC: 78,944
Message 34918 - Posted: 8 Apr 2018, 18:07:36 UTC

Had a couple of CMS subtasks within the past hours that were finished but did not upload any result file.
Instead a fresh subtask was started.

Any known problems?
ID: 34918 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 696
Credit: 5,575,006
RAC: 2,272
Message 34921 - Posted: 8 Apr 2018, 19:05:39 UTC - in response to Message 34918.  

Had a couple of CMS subtasks within the past hours that were finished but did not upload any result file.
Instead a fresh subtask was started.

Any known problems?

No, but if you can identify the jobs I can ask Federica if she can find the logs.
NB: I have a meeting the next three days, I'll be away from my main computers and may not be able to respond as promptly as normal.
ID: 34921 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 696
Credit: 5,575,006
RAC: 2,272
Message 34922 - Posted: 8 Apr 2018, 19:08:07 UTC - in response to Message 34921.  

Had a couple of CMS subtasks within the past hours that were finished but did not upload any result file.
Instead a fresh subtask was started.

Any known problems?

No, but if you can identify the jobs I can ask Federica if she can find the logs.
NB: I have a meeting the next three days, I'll be away from my main computers and may not be able to respond as promptly as normal.

[/Edit] Oops, upspike on failed jobs. Investigating... [/Edit]
ID: 34922 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 696
Credit: 5,575,006
RAC: 2,272
Message 34923 - Posted: 8 Apr 2018, 19:19:54 UTC - in response to Message 34922.  

Had a couple of CMS subtasks within the past hours that were finished but did not upload any result file.
Instead a fresh subtask was started.

Any known problems?

No, but if you can identify the jobs I can ask Federica if she can find the logs.
NB: I have a meeting the next three days, I'll be away from my main computers and may not be able to respond as promptly as normal.

[Edit] Oops, upspike on failed jobs. Investigating... [/Edit]

[Edit^2] Data bridge has gone to sleep again. E-mails sent, but it's Sunday night... [/Edit^2]
ID: 34923 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1504
Credit: 82,863,173
RAC: 78,944
Message 34925 - Posted: 8 Apr 2018, 21:00:43 UTC - in response to Message 34923.  

... Data bridge has gone to sleep again. E-mails sent, but it's Sunday night...

Last upload was successful.
Thanks, Ivan.

[08/Apr/2018:22:50:34 +0200] "PUT http://vc-cms-output.cs3.cern.ch/unmerged/DMWM_Test/QCD_Pt-40toInf_fwdJet_bwdJet_Tune4C_2p76TeV-pythia8/GEN-SIM/MonteCarlo_eff2_CMS_Home_IDRv3bn-v11/00009/1042E151-643B-E811-8A64-080027B6AEF0.root? HTTP/1.1" 200 69732879 "-" "gfal2-util/1.3.2 gfal2/2.11.1 neon/0.0.29" TCP_MISS:HIER_DIRECT
ID: 34925 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 696
Credit: 5,575,006
RAC: 2,272
Message 34926 - Posted: 8 Apr 2018, 21:52:16 UTC - in response to Message 34925.  

Yes, we have green again. Thanks to whoever responded to my mails. :-)
ID: 34926 · Report as offensive     Reply Quote
Profile MAGIC Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 960
Credit: 40,615,774
RAC: 5,560
Message 34927 - Posted: 8 Apr 2018, 23:41:32 UTC
Last modified: 8 Apr 2018, 23:44:37 UTC

Back to normal here
ID: 34927 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1504
Credit: 82,863,173
RAC: 78,944
Message 34949 - Posted: 10 Apr 2018, 14:13:28 UTC

Although all of my hosts seem to report CMS subtask results quite well at the moment, there is a growing red peak at one of the dashboard graphics.
This may (or may not) be an indication for an upcoming problem.
I wonder if anyone of the project team is aware of it as Ivan mentioned he is currently absent.
ID: 34949 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 696
Credit: 5,575,006
RAC: 2,272
Message 34965 - Posted: 12 Apr 2018, 1:55:31 UTC - in response to Message 34949.  
Last modified: 12 Apr 2018, 1:59:47 UTC

I think that was during the ending of one batch and the starting of the next. We have some tuning/configuration to do after recent changes. At present you guys do the simulations and write small result files to Data Bridge. That's going well, with a small failure rate due to VMs' not starting up properly after a sudden shutdown, etc. Then Laurence's cluster, now known as CMS site T3_CH_CMSAtHome, reads the small files from the (untrusted) Data Bridge and merges them into much larger (~2.5 Gb!) files and writes them into trusted storage on our CERN Tier2 site -- that's running at 100% success. However the two other main post-production jobs are failing 100% -- where CMSAtHome then tries to delete the original result files on Data Bridge and when it also tries to read all the log files from Data Bridge and do a similar merge on them, writing the results to trusted storage as well. And these failing jobs run towards the end of the total batch, i.e. within a short time span.
Neither affects the final result files being written into trusted storage; the Data Bridge deletes files older than 4 months by default anyhow, but not having the merged log files may be a problem if some unusual behaviour arises. We are working on this, but Easter and my being away in Scotland this week have slowed progress (when you have a meeting of 50 or 60 sysadmins trying to run their home systems from a resort hotel during a conference, the bandwidth contention on the WiFi links leads to molasses-like response...). There are some e-mails I have to read later today when I wake up (I got back from Pitlochry at 2315 last night) and then we'll try to formulate a strategy. You might see smaller batches (if you're monitoring Dashboard) for a while, because one debug strategy I see is to add diagnostic printouts to the relevant python scripts so that Federica can scan the logs for them, and fewer jobs per batch would reduce the turnaround latency.

[Edit] However something new has arisen in the last hour or so. Checking... [/Edit]
ID: 34965 · Report as offensive     Reply Quote
1 · 2 · 3 · Next

Message boards : CMS Application : Result stage-out failures


©2020 CERN