Message boards :
CMS Application :
Result stage-out failures
Message board moderation
Author | Message |
---|---|
Send message Joined: 29 Aug 05 Posts: 997 Credit: 6,264,307 RAC: 71 |
We appear to be having a problem with returning result files to the Data Bridge at CERN. Your jobs are running OK, but attempts to store the results are failing -- hence the big red spike you can see on the Dashboard plots (Jobs -> CMS Jobs). I've opened a service ticket with CERN IT. If you wish, you can stop running CMS tasks for now, or switch to another project, until the problem is solved. |
Send message Joined: 29 Aug 05 Posts: 997 Credit: 6,264,307 RAC: 71 |
|
Send message Joined: 24 Oct 04 Posts: 1114 Credit: 49,501,728 RAC: 4,157 |
|
Send message Joined: 15 Jun 08 Posts: 2386 Credit: 222,891,056 RAC: 138,179 |
Nothing but errors on the status page since last night. Is anybody from the project team available to check this? |
Send message Joined: 29 Aug 05 Posts: 997 Credit: 6,264,307 RAC: 71 |
Nothing but errors on the status page since last night. There was a problem with the Data Bridge last night. It has been corrected (and no, I know no more detail than that). We're in a transitional stage at the moment, unfortunately with Easter looming, but the good bit of news I can impart is that as of a day or two ago we are writing merged result files into the T2_CH_CERN storage -- which means they are now available world-wide to anyone in CMS who wants to use them. *NOW* the challenge is to get workflows that suit our capabilities and limitations and produce results that end up in papers we can all point to with pride! |
Send message Joined: 24 Oct 04 Posts: 1114 Credit: 49,501,728 RAC: 4,157 |
I only had problems on March 27th Only 5 tasks though and still had 9 Valids that day. 111 Valids and 13 Errors in the last week. Volunteer Mad Scientist For Life |
Send message Joined: 29 Aug 05 Posts: 997 Credit: 6,264,307 RAC: 71 |
|
Send message Joined: 29 Aug 05 Posts: 997 Credit: 6,264,307 RAC: 71 |
|
Send message Joined: 24 Oct 04 Posts: 1114 Credit: 49,501,728 RAC: 4,157 |
&^%$$#@ I usually don't expect to see this when I check to see if I need more tasks https://lhcathome.cern.ch/lhcathome/results.php?hostid=10451775 I hope a reload here doesn't do that again. (it is like a jinx every time I think things are running ok) Volunteer Mad Scientist For Life |
Send message Joined: 17 Sep 04 Posts: 99 Credit: 30,618,118 RAC: 3,938 |
How were the results used before this? Thanks! Regards, Bob P. |
Send message Joined: 29 Aug 05 Posts: 997 Credit: 6,264,307 RAC: 71 |
Not widely, to be honest. To many people in CMS this is still "development", but we did run one workflow for over 6 months that couldn't get time on the GRID computers -- see this paper and also an overall view of LHC@home here. |
Send message Joined: 15 Jun 08 Posts: 2386 Credit: 222,891,056 RAC: 138,179 |
Had a couple of CMS subtasks within the past hours that were finished but did not upload any result file. Instead a fresh subtask was started. Any known problems? |
Send message Joined: 29 Aug 05 Posts: 997 Credit: 6,264,307 RAC: 71 |
Had a couple of CMS subtasks within the past hours that were finished but did not upload any result file. No, but if you can identify the jobs I can ask Federica if she can find the logs. NB: I have a meeting the next three days, I'll be away from my main computers and may not be able to respond as promptly as normal. |
Send message Joined: 29 Aug 05 Posts: 997 Credit: 6,264,307 RAC: 71 |
Had a couple of CMS subtasks within the past hours that were finished but did not upload any result file. [/Edit] Oops, upspike on failed jobs. Investigating... [/Edit] |
Send message Joined: 29 Aug 05 Posts: 997 Credit: 6,264,307 RAC: 71 |
Had a couple of CMS subtasks within the past hours that were finished but did not upload any result file. [Edit^2] Data bridge has gone to sleep again. E-mails sent, but it's Sunday night... [/Edit^2] |
Send message Joined: 15 Jun 08 Posts: 2386 Credit: 222,891,056 RAC: 138,179 |
... Data bridge has gone to sleep again. E-mails sent, but it's Sunday night... Last upload was successful. Thanks, Ivan. [08/Apr/2018:22:50:34 +0200] "PUT http://vc-cms-output.cs3.cern.ch/unmerged/DMWM_Test/QCD_Pt-40toInf_fwdJet_bwdJet_Tune4C_2p76TeV-pythia8/GEN-SIM/MonteCarlo_eff2_CMS_Home_IDRv3bn-v11/00009/1042E151-643B-E811-8A64-080027B6AEF0.root? HTTP/1.1" 200 69732879 "-" "gfal2-util/1.3.2 gfal2/2.11.1 neon/0.0.29" TCP_MISS:HIER_DIRECT |
Send message Joined: 29 Aug 05 Posts: 997 Credit: 6,264,307 RAC: 71 |
|
Send message Joined: 24 Oct 04 Posts: 1114 Credit: 49,501,728 RAC: 4,157 |
Back to normal here |
Send message Joined: 15 Jun 08 Posts: 2386 Credit: 222,891,056 RAC: 138,179 |
Although all of my hosts seem to report CMS subtask results quite well at the moment, there is a growing red peak at one of the dashboard graphics. This may (or may not) be an indication for an upcoming problem. I wonder if anyone of the project team is aware of it as Ivan mentioned he is currently absent. |
Send message Joined: 29 Aug 05 Posts: 997 Credit: 6,264,307 RAC: 71 |
I think that was during the ending of one batch and the starting of the next. We have some tuning/configuration to do after recent changes. At present you guys do the simulations and write small result files to Data Bridge. That's going well, with a small failure rate due to VMs' not starting up properly after a sudden shutdown, etc. Then Laurence's cluster, now known as CMS site T3_CH_CMSAtHome, reads the small files from the (untrusted) Data Bridge and merges them into much larger (~2.5 Gb!) files and writes them into trusted storage on our CERN Tier2 site -- that's running at 100% success. However the two other main post-production jobs are failing 100% -- where CMSAtHome then tries to delete the original result files on Data Bridge and when it also tries to read all the log files from Data Bridge and do a similar merge on them, writing the results to trusted storage as well. And these failing jobs run towards the end of the total batch, i.e. within a short time span. Neither affects the final result files being written into trusted storage; the Data Bridge deletes files older than 4 months by default anyhow, but not having the merged log files may be a problem if some unusual behaviour arises. We are working on this, but Easter and my being away in Scotland this week have slowed progress (when you have a meeting of 50 or 60 sysadmins trying to run their home systems from a resort hotel during a conference, the bandwidth contention on the WiFi links leads to molasses-like response...). There are some e-mails I have to read later today when I wake up (I got back from Pitlochry at 2315 last night) and then we'll try to formulate a strategy. You might see smaller batches (if you're monitoring Dashboard) for a while, because one debug strategy I see is to add diagnostic printouts to the relevant python scripts so that Federica can scan the logs for them, and fewer jobs per batch would reduce the turnaround latency. [Edit] However something new has arisen in the last hour or so. Checking... [/Edit] |
©2024 CERN