Message boards : CMS Application : New CMS job graphs
Message board moderation

To post messages, you must log in.

AuthorMessage
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 678
Credit: 5,525,579
RAC: 1,358
Message 39510 - Posted: 7 Aug 2019, 14:03:07 UTC

I'm trying to work out how to let everyone see the new Dashboard (Grafana) job plots. If you don't have CERN credentials, it appears that you can obtain limited credentials if you are a member of a specified list of organisations or certain public services such as Facebook and Google.
Try to access the plots via my test page: https://www.brunel.ac.uk/~eesridr/cms_job.php. If you get a Grafana log-in page, select the CERN SSO option (single sign-on) and see if you can create your own permissions.
Let me know if it works. If it's successful, I'll pass it on to Laurence to put on our web-site.
ID: 39510 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1481
Credit: 79,948,365
RAC: 81,119
Message 39511 - Posted: 7 Aug 2019, 14:50:45 UTC - in response to Message 39510.  

Let me know if it works.

Thanks.
I used a google account and it works as described.

Among lots of other information it shows how many jobs are completed/currently running.
Does it also show how many are waiting in the queue (pending seems to have a different meaning)?
ID: 39511 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1481
Credit: 79,948,365
RAC: 81,119
Message 39513 - Posted: 7 Aug 2019, 19:20:29 UTC

The given links to grafana set a fix timeframe, e.g.
from 2019-07-31 14:55:16
to 2019-08-07 14:43:16
from=1564577716023&to=1565181796023



The following part of the links might be changed to show (e.g.) the last week until "now":
from=now-7d&to=now
ID: 39513 · Report as offensive     Reply Quote
Profile MAGIC Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 950
Credit: 40,397,195
RAC: 4,165
Message 39515 - Posted: 7 Aug 2019, 19:56:32 UTC
Last modified: 7 Aug 2019, 20:09:23 UTC

( I think some seti critters have taken over my Hughes satellite so the speed is like a 1995 dialup right now)
You would think that after almost 60 years we would have a satellite system that was a bit faster than this snail from space Hughes has as their top of the line Gen5


Well imagine that....
ID: 39515 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 678
Credit: 5,525,579
RAC: 1,358
Message 39518 - Posted: 8 Aug 2019, 8:23:03 UTC - in response to Message 39513.  

The given links to grafana set a fix timeframe, e.g.
from 2019-07-31 14:55:16
to 2019-08-07 14:43:16
from=1564577716023&to=1565181796023



The following part of the links might be changed to show (e.g.) the last week until "now":
from=now-7d&to=now

Ah, thanks for catching that. I'll change it later -- have to go arrange the payment of my next 6 month's rent just now...
ID: 39518 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 678
Credit: 5,525,579
RAC: 1,358
Message 39522 - Posted: 8 Aug 2019, 13:05:40 UTC - in response to Message 39511.  

Among lots of other information it shows how many jobs are completed/currently running.
Does it also show how many are waiting in the queue (pending seems to have a different meaning)?

I'll have to dig around to see if that's available -- as you say, Grafana seems to have a different definition of "pending" to what I see in WMStats.

ID: 39522 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 678
Credit: 5,525,579
RAC: 1,358
Message 39523 - Posted: 8 Aug 2019, 13:11:03 UTC - in response to Message 39518.  

Time range changed to last 7 days.
ID: 39523 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1481
Credit: 79,948,365
RAC: 81,119
Message 39548 - Posted: 9 Aug 2019, 18:38:39 UTC

Looking at the different grafana graphics here I'm a bit confused.

#of completed jobs seems to be stable at a bit more than 50/h.
#of running jobs seems to be stable around 150/h.
Failure rate average is far less than 10%.

What happened to roughly 100 jobs/h?
ID: 39548 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 678
Credit: 5,525,579
RAC: 1,358
Message 39558 - Posted: 10 Aug 2019, 8:30:23 UTC - in response to Message 39548.  

Looking at the different grafana graphics here I'm a bit confused.

#of completed jobs seems to be stable at a bit more than 50/h.
#of running jobs seems to be stable around 150/h.
Failure rate average is far less than 10%.

What happened to roughly 100 jobs/h?

I changed the job duration a couple of weeks ago, from 40,000 to 100,000 events, as the average CPU time was under 1 hour (I couldn't see this easily until the new Dashboard was introduced). Hopefully this will improve efficiency -- more CPU for the same amount of "downtime" during jobs. You'll see the average CPU time is now pushing 1.5 hours (with lots of variation from hour to hour). I'm still looking for a "queued" or "pending" graph that matches what I see on WMStats.
ID: 39558 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1481
Credit: 79,948,365
RAC: 81,119
Message 39562 - Posted: 10 Aug 2019, 12:42:03 UTC

I agree: The "Average CPU time" graph shows values between 1 and 2 h.
That's not what I meant.

The "Running jobs" graph shows a (more or less) stable rate of 130-178 jobs/h between 2019-08-08 0:00 and 2019-08-10 12:00.
If all of them would need 2 h to complete I would expect the "Completed jobs" graph to ramp up until roughly the same rate of 130-178 jobs/h is reached between 2019-08-08 2:00 and 2019-08-10 14:00.
Unfortunately "Completed jobs" shows only 38-78 jobs/h.
The timeframe of 2.5 d seems to be long enough to cover short term delays.
Unlike BOINC tasks condor jobs are usually not buffered at the clients.
So, why is the "Completed jobs" rate far below the "Running jobs" rate?
Do I misunderstand the definitions, e.g. "running" or "completed"?
ID: 39562 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 678
Credit: 5,525,579
RAC: 1,358
Message 39564 - Posted: 10 Aug 2019, 17:32:04 UTC - in response to Message 39562.  

I agree: The "Average CPU time" graph shows values between 1 and 2 h.
That's not what I meant.

The "Running jobs" graph shows a (more or less) stable rate of 130-178 jobs/h between 2019-08-08 0:00 and 2019-08-10 12:00.
If all of them would need 2 h to complete I would expect the "Completed jobs" graph to ramp up until roughly the same rate of 130-178 jobs/h is reached between 2019-08-08 2:00 and 2019-08-10 14:00.
Unfortunately "Completed jobs" shows only 38-78 jobs/h.
The timeframe of 2.5 d seems to be long enough to cover short term delays.
Unlike BOINC tasks condor jobs are usually not buffered at the clients.
So, why is the "Completed jobs" rate far below the "Running jobs" rate?
Do I misunderstand the definitions, e.g. "running" or "completed"?

I think you might, slightly. You should also be able to access the Grafana summary page which also gives definitions for the graphs (under the "i for information" icon at the top left). For running jobs it says, "Total number of running jobs in a given time bucket," while for completed jobs it's, "Number of jobs that reached completion in a given time bucket." So, if a job runs for two hours it will count as running in up to three [one-hour] time buckets, but it will only count as completed in one bucket. Basically the completed jobs is the number of jobs/hour[bucket] divided by the average time per job in hours[buckets]. You can see this more starkly if you change to the 12-minute binning. Then, the running jobs still stays around 150 (per time bin) but the number of jobs completed per time bin drops to around about 12. Conversely, if you change to one-day binning the running jobs stay the same (less for incomplete days) but the completed jobs go up to nearly 1,500/day.
ID: 39564 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1481
Credit: 79,948,365
RAC: 81,119
Message 39566 - Posted: 10 Aug 2019, 18:26:12 UTC - in response to Message 39564.  

... if a job runs for two hours it will count as running in up to three [one-hour] time buckets.

Yes.
I think exactly this was my fault.
I simply ignored the fact that a single job counts multiple times.

Thanks for explaining.
ID: 39566 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 442
Credit: 23,652,021
RAC: 14,291
Message 41820 - Posted: 5 Mar 2020, 12:12:07 UTC

I think that the CMS job graphs should be made available to all so you don't need to log in to Grafana (Cern). The SSO option for login does not work for Google or Windows Live, it just gives an error 'There was a problem accessing the site.', so I cannot see any of the graphs.
ID: 41820 · Report as offensive     Reply Quote

Message boards : CMS Application : New CMS job graphs


©2020 CERN