LHCb VMs have longer runtimes

Author	Message
computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2541 Credit: 254,608,838 RAC: 34,609	Message 34061 - Posted: 24 Jan 2018, 12:38:35 UTC Since yesterday evening the average runtimes of LHCb VMs are much longer than the weeks before. Do we crunch another type of jobs or is it a result of the server works? ID: 34061 · Reply Quote

marmot Send message Joined: 5 Nov 15 Posts: 144 Credit: 6,301,268 RAC: 0	Message 34434 - Posted: 21 Feb 2018, 1:27:28 UTC - in response to Message 34061. Since yesterday evening the average runtimes of LHCb VMs are much longer than the weeks before. Do we crunch another type of jobs or is it a result of the server works? I switched one machine to LHCb and 50 of the 75 WU's were typically 2000-4000 seconds run time and paid about 3 credit each. A lot of bandwidth used for such short runs. They did two CONDOR jobs in that 2000 seconds. For example: Condor JobID: 24211.257 Condor JobID: 24211.462 were completed. The 25 of 75 WU's that survived over 15,000 seconds (usually ~43,000 seconds) completed between 20 and 60 CONDOR jobs. (This was after I got the issues with the WiFi router resolved) Are your extremely long run times WU's performing 80, 120, 150 CONDOR jobs? Is the WU's algorithm for deciding the number of CONDOR jobs before shutting down the VM not working correctly in all environments? ID: 34434 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2541 Credit: 254,608,838 RAC: 34,609	Message 34438 - Posted: 21 Feb 2018, 7:11:00 UTC - in response to Message 34434. The majority of LHCb WUs run only 2 short subjobs and then shut down (runtimes: 950-1300 s). This results in a CPU efficiency of less than 25% but lots of network traffic to setup the VM, in my eyes a waste of resources. As this situation did not change for months and nobody from the project team seems to be interested to explain it, I run LHCb only occasionally to see if there are any changes. The longer runtimes I mentioned in my post below were only during a short period some weeks ago. ID: 34438 · Reply Quote

Jim1348 Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0	Message 34441 - Posted: 21 Feb 2018, 10:24:29 UTC - in response to Message 34434. Are your extremely long run times WU's performing 80, 120, 150 CONDOR jobs? I don't know about the number, but on 10 February I got one LHCb that ran for 4 hours with 89% CPU usage, and another one that ran for 2 hours with 84% usage. There was also one on that date that ran for 1 hour, but only 25% CPU usage. All the others are even shorter, around 18 to 25 minutes, with low CPU usage in the range of 25 to 30% on my machine (Ryzen 1700 on Lubuntu 17.10.1). So it seems that the CPU usage is poor until you get up to some minimum of greater than one hour, whatever it is. ID: 34441 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2541 Credit: 254,608,838 RAC: 34,609	Message 34443 - Posted: 21 Feb 2018, 10:36:19 UTC - in response to Message 34441. Are your extremely long run times WU's performing 80, 120, 150 CONDOR jobs? I don't know about the number ... Count them: grep -c 'Job finished in slot' stderr.txt ;-) ID: 34443 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,940,965 RAC: 22,176	Message 34445 - Posted: 21 Feb 2018, 11:25:55 UTC - in response to Message 34434. A lot of bandwidth used for such short runs. while CMS was down over last weekend, I switched to LHCb for several days. On one of my machines, 8 tasks were running concurrently, resulting in a bandwith usage of roughly 60GB / 24 hours. For me no problem, having a fast flatrate. For some others though, this huge usage may not be so nice. ID: 34445 · Reply Quote

Jim1348 Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0	Message 34447 - Posted: 21 Feb 2018, 15:08:38 UTC - in response to Message 34443. Count them: grep -c 'Job finished in slot' stderr.txt Whatever slot it was in appears to be long gone. ID: 34447 · Reply Quote

LHC@home