Message boards : LHCb Application : LHCb VMs have longer runtimes
Message board moderation

To post messages, you must log in.

AuthorMessage
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1428
Credit: 73,041,871
RAC: 106,128
Message 34061 - Posted: 24 Jan 2018, 12:38:35 UTC

Since yesterday evening the average runtimes of LHCb VMs are much longer than the weeks before.
Do we crunch another type of jobs or is it a result of the server works?
ID: 34061 · Report as offensive     Reply Quote
marmot
Avatar

Send message
Joined: 5 Nov 15
Posts: 127
Credit: 6,221,829
RAC: 0
Message 34434 - Posted: 21 Feb 2018, 1:27:28 UTC - in response to Message 34061.  

Since yesterday evening the average runtimes of LHCb VMs are much longer than the weeks before.
Do we crunch another type of jobs or is it a result of the server works?



I switched one machine to LHCb and 50 of the 75 WU's were typically 2000-4000 seconds run time and paid about 3 credit each.
A lot of bandwidth used for such short runs.

They did two CONDOR jobs in that 2000 seconds.
For example:
Condor JobID: 24211.257
Condor JobID: 24211.462
were completed.

The 25 of 75 WU's that survived over 15,000 seconds (usually ~43,000 seconds) completed between 20 and 60 CONDOR jobs. (This was after I got the issues with the WiFi router resolved)

Are your extremely long run times WU's performing 80, 120, 150 CONDOR jobs?

Is the WU's algorithm for deciding the number of CONDOR jobs before shutting down the VM not working correctly in all environments?
ID: 34434 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1428
Credit: 73,041,871
RAC: 106,128
Message 34438 - Posted: 21 Feb 2018, 7:11:00 UTC - in response to Message 34434.  

The majority of LHCb WUs run only 2 short subjobs and then shut down (runtimes: 950-1300 s).
This results in a CPU efficiency of less than 25% but lots of network traffic to setup the VM, in my eyes a waste of resources.

As this situation did not change for months and nobody from the project team seems to be interested to explain it, I run LHCb only occasionally to see if there are any changes.

The longer runtimes I mentioned in my post below were only during a short period some weeks ago.
ID: 34438 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 416
Credit: 11,880,818
RAC: 2,982
Message 34441 - Posted: 21 Feb 2018, 10:24:29 UTC - in response to Message 34434.  

Are your extremely long run times WU's performing 80, 120, 150 CONDOR jobs?

I don't know about the number, but on 10 February I got one LHCb that ran for 4 hours with 89% CPU usage, and another one that ran for 2 hours with 84% usage. There was also one on that date that ran for 1 hour, but only 25% CPU usage. All the others are even shorter, around 18 to 25 minutes, with low CPU usage in the range of 25 to 30% on my machine (Ryzen 1700 on Lubuntu 17.10.1). So it seems that the CPU usage is poor until you get up to some minimum of greater than one hour, whatever it is.
ID: 34441 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1428
Credit: 73,041,871
RAC: 106,128
Message 34443 - Posted: 21 Feb 2018, 10:36:19 UTC - in response to Message 34441.  

Are your extremely long run times WU's performing 80, 120, 150 CONDOR jobs?

I don't know about the number ...

Count them:
grep -c 'Job finished in slot' stderr.txt

;-)
ID: 34443 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1260
Credit: 22,998,247
RAC: 2,880
Message 34445 - Posted: 21 Feb 2018, 11:25:55 UTC - in response to Message 34434.  

A lot of bandwidth used for such short runs.
while CMS was down over last weekend, I switched to LHCb for several days. On one of my machines, 8 tasks were running concurrently, resulting in a bandwith usage of roughly 60GB / 24 hours.
For me no problem, having a fast flatrate. For some others though, this huge usage may not be so nice.
ID: 34445 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 416
Credit: 11,880,818
RAC: 2,982
Message 34447 - Posted: 21 Feb 2018, 15:08:38 UTC - in response to Message 34443.  

Count them:
grep -c 'Job finished in slot' stderr.txt

Whatever slot it was in appears to be long gone.
ID: 34447 · Report as offensive     Reply Quote

Message boards : LHCb Application : LHCb VMs have longer runtimes


©2020 CERN