1) Message boards : CMS Application : CMS Tasks Failing (Message 43359)
Posted 5 days ago by ivan
Post:
Is there a good recipe for disabling Hyper-V?

A recent comment posted by Microsoft:
https://docs.microsoft.com/en-us/troubleshoot/windows-client/application-management/virtualization-apps-not-work-with-hyper-v

Thanks. I've done most of those I think, but I'll go through it step-by-step.
2) Message boards : CMS Application : CMS Tasks Failing (Message 43356)
Posted 6 days ago by ivan
Post:
vt-x need to be enabled in the BIOS of a Intel-PC,
also Hyper-V in Windows need to be DISABLED.
After a reboot and other Errors, please report it.

Is there a good recipe for disabling Hyper-V? I tried several methods found on the Web, but still I get
Virtualization Virtualbox (6.1.12) installed, CPU does not have hardware virtualization support
in https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10653693
I thought it might have been because I had Windows Subsystem for Linux installed, but after I
removed that, Hyper-V still comes back every time I boot.
3) Message boards : CMS Application : EXIT_NO_SUB_TASKS (Message 43343)
Posted 8 days ago by ivan
Post:
no jobs since last night; consequently, the tasks download queue was stopped

The workflow I injected yesterday is not sending jobs to the Condor server, and the previous one drained its queue. I've submitted a new workflow; if it also doesn't start then I'll raise a ticket with CERN IT.

It looks like they were blocked by a bug in the testbed system. They've been manually moved to "staged" and one is now showing as "running-open" so hopefully jobs will start flowing soon.
[Edit] Yes, I'm starting to see numbers in the "pending" and "running" columns. [/Edit]
4) Message boards : CMS Application : EXIT_NO_SUB_TASKS (Message 43341)
Posted 8 days ago by ivan
Post:
no jobs since last night; consequently, the tasks download queue was stopped

The workflow I injected yesterday is not sending jobs to the Condor server, and the previous one drained its queue. I've submitted a new workflow; if it also doesn't start then I'll raise a ticket with CERN IT.
5) Message boards : CMS Application : New Version 50.00 (Message 43321)
Posted 12 days ago by ivan
Post:
You can see some data on job timings, etc., in the job graphs. I grabbed graphs that I felt were most useful, but you can play around with the parameters if you like (in particular, if you click on the back-arrow within a plot, you can see a whole lot of other plots that you can view in full by clicking on the plot title and selecting "View" on the drop-down menu). Note that not all of these graphs are properly populated, CMS@Home is not a high priority for the monitoring crew.
My initial aim when this all started was to run jobs (or sub-tasks as some call them) that ran for 1-2 hours and returned up to 100 MB of results. This was mainly based on my connection at the time, which was 5-6 Mbps download and 1 Mbps upload, and the assumption that most people would only run one task at a time, or at least adjust the number of tasks to suit their connectivity. There has always been the problem of people being over-enthusiastic about their contribution and running into the sort of problem being discussed here. We also have to choose our tasks carefully, I could easily send you jobs that would tax a 100 Mbps link!
6) Message boards : CMS Application : Grafana Errors (Message 43294)
Posted 23 days ago by ivan
Post:
We now have a new set of monit/grafana job graphs, because CMS has updated their monitoring. They still show the same things, but they are now available on this LHC@Home site as well as at LHC@Home-dev.
7) Message boards : CMS Application : Subtask Results don't upload (Message 43285)
Posted 24 days ago by ivan
Post:
The failure rate is again at 100% since 19:24 UTC.
Sorry Ivan.
This requires more investigation.

Yes, I've just noticed that. I've updated the incident ticket. Bummer... I haven't heard yet exactly what caused the initial problem.

Apparently an automatic update broke things again. It's working now, and updates have been disabled...
8) Message boards : CMS Application : Subtask Results don't upload (Message 43283)
Posted 25 days ago by ivan
Post:
The failure rate is again at 100% since 19:24 UTC.
Sorry Ivan.
This requires more investigation.

Yes, I've just noticed that. I've updated the incident ticket. Bummer... I haven't heard yet exactly what caused the initial problem.
9) Message boards : CMS Application : Subtask Results don't upload (Message 43279)
Posted 25 days ago by ivan
Post:
I have now opened a ticket on this with CERN IT.

There was a problem with the DataBridge which now seems to have been resolved. The failure rate has been rather more sensible since about 0730 GMT.
10) Message boards : CMS Application : Subtask Results don't upload (Message 43276)
Posted 26 days ago by ivan
Post:
I have now opened a ticket on this with CERN IT.
11) Message boards : CMS Application : Subtask Results don't upload (Message 43260)
Posted 28 days ago by ivan
Post:
We are definitely having problems at the moment -- it seemed to start around 2100 UTC Friday night. All stage-outs (uploads of data and logs) are failing, although this is not reflected in your user credits. I've been trying to find out what's wrong but given that it's a Sunday in August in Europe, there has yet to be any response. If you can, change to another project while we track this down, to minimise our failed traffic load.
12) Message boards : CMS Application : Subtask Results don't upload (Message 43259)
Posted 28 days ago by ivan
Post:
We are having an inordinate number of stage-out failures, both for result files and log files, it seems. I've messaged Laurence.
13) Message boards : CMS Application : Grafana Errors (Message 43240)
Posted 20 Aug 2020 by ivan
Post:
There was a site-wide monitoring problem on Tuesday that seemed to affect Grafana. Yesterday's problems were something different that seems to be fixed now.
14) Message boards : CMS Application : EXIT_NO_SUB_TASKS (Message 43239)
Posted 20 Aug 2020 by ivan
Post:
Monitors are making sense now. I have a bad network connection so it's hard to keep right up-to-date.
15) Message boards : CMS Application : EXIT_NO_SUB_TASKS (Message 43238)
Posted 20 Aug 2020 by ivan
Post:
One of the capables restarted the agent. The monitor still insists there is a problem, but other indications are that jobs are available again. I'm keeping an eye on it as much I can.
16) Message boards : CMS Application : EXIT_NO_SUB_TASKS (Message 43237)
Posted 20 Aug 2020 by ivan
Post:
OK, looks like an upgrade incompatability. If you can follow it:
https://github.com/dmwm/WMCore/issues/9876
17) Message boards : CMS Application : EXIT_NO_SUB_TASKS (Message 43236)
Posted 20 Aug 2020 by ivan
Post:
Actually, on looking more closely, I may have been a bit too harsh on the WMCore developers. This suggests the problem pre-dates the intervention:
agent last updated: 2020/8/18 (Tue) 16:01:14 UTC : 41 h 51 m
There was a general failure of monitoring software across the CERN network on Tuesday, it may have played some part in the problem.
18) Message boards : CMS Application : EXIT_NO_SUB_TASKS (Message 43235)
Posted 20 Aug 2020 by ivan
Post:
Yes, sorry, we seem to be having a WMAgent problem again. It might be related to an "intervention" (i.e. code update) that occurred yesterday -- unfortunately I only ever get notified of these post facto. Our agent needs some manual tweaks when it's restarted and these may need to be re-applied. I mailed several people who can do a kickstart on the agent last night, but no response yet. It's August; I know at least one of them is on holiday.

At least the automatic stoppage of the queue worked. The WMStats agent monitor reckoned that jobs were still running but with this and other indications I now realise that this is false, it just took me a while to look at other indicators. Sorry for the delay, I await developments.
19) Message boards : News : Interruption to CMS@Home, Wednesday 15th July (Message 43067)
Posted 15 Jul 2020 by ivan
Post:
I should have added: We do stop issuing tasks when we detect that there are no new jobs available -- a problem that we've only just solved was that jobs were available but unbeknownst to our software they were flagged as not to run on volunteer machines. This led to the sort of scenario I described earlier, as BOINC thought jobs were available and kept serving up tasks.
Notwithstanding that, remember that our tasks run for more than 12 hours (usually less than 18), running several jobs consecutively. If we run out of jobs mid-task, that can lead to a task being flagged as failed. This is why I try to give as much warning as possible of an upcoming outage, so that tasks can be left to finish up before jobs run out.
20) Message boards : CMS Application : CMS Tasks Failing (Message 43065)
Posted 15 Jul 2020 by ivan
Post:
We are running again.


Next 20


©2020 CERN