Thread 'Result stage-out failures'

Author	Message
ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1163 Credit: 11,930,268 RAC: 7,617	Message 34580 - Posted: 11 Mar 2018, 11:51:26 UTC We appear to be having a problem with returning result files to the Data Bridge at CERN. Your jobs are running OK, but attempts to store the results are failing -- hence the big red spike you can see on the Dashboard plots (Jobs -> CMS Jobs). I've opened a service ticket with CERN IT. If you wish, you can stop running CMS tasks for now, or switch to another project, until the problem is solved. ID: 34580 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1163 Credit: 11,930,268 RAC: 7,617	Message 34591 - Posted: 12 Mar 2018, 13:49:01 UTC - in response to Message 34580. Last modified: 12 Mar 2018, 14:00:51 UTC I think the Data Bridge problem is solved now; the results graph is starting to show successes again. [Edit] Confirmed. Results are arriving again. You may resume processing CMS tasks. [/Edit] ID: 34591 · Reply Quote

Magic Quantum Mechanic Send message Joined: 24 Oct 04 Posts: 1318 Credit: 98,503,680 RAC: 107,468	Message 34604 - Posted: 13 Mar 2018, 1:55:46 UTC 25 Valids and only 3 errors so far ID: 34604 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2760 Credit: 305,170,487 RAC: 116,091	Message 34787 - Posted: 29 Mar 2018, 7:07:49 UTC Nothing but errors on the status page since last night. Is anybody from the project team available to check this? ID: 34787 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1163 Credit: 11,930,268 RAC: 7,617	Message 34791 - Posted: 29 Mar 2018, 13:52:19 UTC - in response to Message 34787. Nothing but errors on the status page since last night. Is anybody from the project team available to check this? There was a problem with the Data Bridge last night. It has been corrected (and no, I know no more detail than that). We're in a transitional stage at the moment, unfortunately with Easter looming, but the good bit of news I can impart is that as of a day or two ago we are writing merged result files into the T2_CH_CERN storage -- which means they are now available world-wide to anyone in CMS who wants to use them. NOW the challenge is to get workflows that suit our capabilities and limitations and produce results that end up in papers we can all point to with pride! ID: 34791 · Reply Quote

Magic Quantum Mechanic Send message Joined: 24 Oct 04 Posts: 1318 Credit: 98,503,680 RAC: 107,468	Message 34793 - Posted: 29 Mar 2018, 19:32:03 UTC I only had problems on March 27th Only 5 tasks though and still had 9 Valids that day. 111 Valids and 13 Errors in the last week. Volunteer Mad Scientist For Life unbelievable are you trying to promote linux again? ID: 34793 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1163 Credit: 11,930,268 RAC: 7,617	Message 34806 - Posted: 30 Mar 2018, 12:03:58 UTC - in response to Message 34793. Apparently there have been network problems, etc., at CERN today. We seem to have a few jobs running again now but the WMAgent is flagging a problem, so I've messaged the relevant maintainers. Whether they are able to respond on Good Friday remains to be seen. ID: 34806 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1163 Credit: 11,930,268 RAC: 7,617	Message 34815 - Posted: 30 Mar 2018, 19:08:59 UTC - in response to Message 34806. Last modified: 30 Mar 2018, 19:09:15 UTC The failed component was restarted, and we're now showing green across the board. Thanks to the people who worked on it during a major holiday! ID: 34815 · Reply Quote

Magic Quantum Mechanic Send message Joined: 24 Oct 04 Posts: 1318 Credit: 98,503,680 RAC: 107,468	Message 34821 - Posted: 30 Mar 2018, 23:02:45 UTC &^%$$#@ I usually don't expect to see this when I check to see if I need more tasks https://lhcathome.cern.ch/lhcathome/results.php?hostid=10451775 I hope a reload here doesn't do that again. (it is like a jinx every time I think things are running ok) Volunteer Mad Scientist For Life unbelievable are you trying to promote linux again? ID: 34821 · Reply Quote

rbpeake Send message Joined: 17 Sep 04 Posts: 106 Credit: 36,685,648 RAC: 6,061	Message 34825 - Posted: 31 Mar 2018, 2:33:13 UTC - in response to Message 34791. We're in a transitional stage at the moment, unfortunately with Easter looming, but the good bit of news I can impart is that as of a day or two ago we are writing merged result files into the T2_CH_CERN storage -- which means they are now available world-wide to anyone in CMS who wants to use them. NOW the challenge is to get workflows that suit our capabilities and limitations and produce results that end up in papers we can all point to with pride! How were the results used before this? Thanks! Regards, Bob P. ID: 34825 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1163 Credit: 11,930,268 RAC: 7,617	Message 34831 - Posted: 31 Mar 2018, 10:15:53 UTC - in response to Message 34825. We're in a transitional stage at the moment, unfortunately with Easter looming, but the good bit of news I can impart is that as of a day or two ago we are writing merged result files into the T2_CH_CERN storage -- which means they are now available world-wide to anyone in CMS who wants to use them. NOW the challenge is to get workflows that suit our capabilities and limitations and produce results that end up in papers we can all point to with pride! How were the results used before this? Thanks! Not widely, to be honest. To many people in CMS this is still "development", but we did run one workflow for over 6 months that couldn't get time on the GRID computers -- see this paper and also an overall view of LHC@home here. ID: 34831 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2760 Credit: 305,170,487 RAC: 116,091	Message 34918 - Posted: 8 Apr 2018, 18:07:36 UTC Had a couple of CMS subtasks within the past hours that were finished but did not upload any result file. Instead a fresh subtask was started. Any known problems? ID: 34918 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1163 Credit: 11,930,268 RAC: 7,617	Message 34921 - Posted: 8 Apr 2018, 19:05:39 UTC - in response to Message 34918. Had a couple of CMS subtasks within the past hours that were finished but did not upload any result file. Instead a fresh subtask was started. Any known problems? No, but if you can identify the jobs I can ask Federica if she can find the logs. NB: I have a meeting the next three days, I'll be away from my main computers and may not be able to respond as promptly as normal. ID: 34921 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1163 Credit: 11,930,268 RAC: 7,617	Message 34922 - Posted: 8 Apr 2018, 19:08:07 UTC - in response to Message 34921. Had a couple of CMS subtasks within the past hours that were finished but did not upload any result file. Instead a fresh subtask was started. Any known problems? No, but if you can identify the jobs I can ask Federica if she can find the logs. NB: I have a meeting the next three days, I'll be away from my main computers and may not be able to respond as promptly as normal. [/Edit] Oops, upspike on failed jobs. Investigating... [/Edit] ID: 34922 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1163 Credit: 11,930,268 RAC: 7,617	Message 34923 - Posted: 8 Apr 2018, 19:19:54 UTC - in response to Message 34922. Had a couple of CMS subtasks within the past hours that were finished but did not upload any result file. Instead a fresh subtask was started. Any known problems? No, but if you can identify the jobs I can ask Federica if she can find the logs. NB: I have a meeting the next three days, I'll be away from my main computers and may not be able to respond as promptly as normal. [Edit] Oops, upspike on failed jobs. Investigating... [/Edit] [Edit^2] Data bridge has gone to sleep again. E-mails sent, but it's Sunday night... [/Edit^2] ID: 34923 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2760 Credit: 305,170,487 RAC: 116,091	Message 34925 - Posted: 8 Apr 2018, 21:00:43 UTC - in response to Message 34923. ]... Data bridge has gone to sleep again. E-mails sent, but it's Sunday night...[/quote] Last upload was successful. Thanks, Ivan. [pre][08/Apr/2018:22:50:34 +0200] "PUT http://vc-cms-output.cs3.cern.ch/unmerged/DMWM_Test/QCD_Pt-40toInf_fwdJet_bwdJet_Tune4C_2p76TeV-pythia8/GEN-SIM/MonteCarlo_eff2_CMS_Home_IDRv3bn-v11/00009/1042E151-643B-E811-8A64-080027B6AEF0.root? HTTP/1.1" 200 69732879 "-" "gfal2-util/1.3.2 gfal2/2.11.1 neon/0.0.29" TCP_MISS:HIER_DIRECT[/pre] ID: 34925 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1163 Credit: 11,930,268 RAC: 7,617	Message 34926 - Posted: 8 Apr 2018, 21:52:16 UTC - in response to Message 34925. Yes, we have green again. Thanks to whoever responded to my mails. :-) ID: 34926 · Reply Quote

Magic Quantum Mechanic Send message Joined: 24 Oct 04 Posts: 1318 Credit: 98,503,680 RAC: 107,468	Message 34927 - Posted: 8 Apr 2018, 23:41:32 UTC Last modified: 8 Apr 2018, 23:44:37 UTC Back to normal here ID: 34927 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2760 Credit: 305,170,487 RAC: 116,091	Message 34949 - Posted: 10 Apr 2018, 14:13:28 UTC Although all of my hosts seem to report CMS subtask results quite well at the moment, there is a growing red peak at one of the dashboard graphics. This may (or may not) be an indication for an upcoming problem. I wonder if anyone of the project team is aware of it as Ivan mentioned he is currently absent. ID: 34949 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1163 Credit: 11,930,268 RAC: 7,617	Message 34965 - Posted: 12 Apr 2018, 1:55:31 UTC - in response to Message 34949. Last modified: 12 Apr 2018, 1:59:47 UTC I think that was during the ending of one batch and the starting of the next. We have some tuning/configuration to do after recent changes. At present you guys do the simulations and write small result files to Data Bridge. That's going well, with a small failure rate due to VMs' not starting up properly after a sudden shutdown, etc. Then Laurence's cluster, now known as CMS site T3_CH_CMSAtHome, reads the small files from the (untrusted) Data Bridge and merges them into much larger (~2.5 Gb!) files and writes them into trusted storage on our CERN Tier2 site -- that's running at 100% success. However the two other main post-production jobs are failing 100% -- where CMSAtHome then tries to delete the original result files on Data Bridge and when it also tries to read all the log files from Data Bridge and do a similar merge on them, writing the results to trusted storage as well. And these failing jobs run towards the end of the total batch, i.e. within a short time span. Neither affects the final result files being written into trusted storage; the Data Bridge deletes files older than 4 months by default anyhow, but not having the merged log files may be a problem if some unusual behaviour arises. We are working on this, but Easter and my being away in Scotland this week have slowed progress (when you have a meeting of 50 or 60 sysadmins trying to run their home systems from a resort hotel during a conference, the bandwidth contention on the WiFi links leads to molasses-like response...). There are some e-mails I have to read later today when I wake up (I got back from Pitlochry at 2315 last night) and then we'll try to formulate a strategy. You might see smaller batches (if you're monitoring Dashboard) for a while, because one debug strategy I see is to add diagnostic printouts to the relevant python scripts so that Federica can scan the logs for them, and fewer jobs per batch would reduce the turnaround latency. [Edit] However something new has arisen in the last hour or so. Checking... [/Edit] ID: 34965 · Reply Quote