Message boards : ATLAS application : Processor Time Locks Up Elapsed Time Continues to Climb
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,084,038
RAC: 105,553
Message 43182 - Posted: 4 Aug 2020, 2:16:32 UTC - in response to Message 43179.  

Sorry, yes.
ID: 43182 · Report as offensive     Reply Quote
keputnam

Send message
Joined: 27 Sep 04
Posts: 102
Credit: 7,086,947
RAC: 1,340
Message 43199 - Posted: 8 Aug 2020, 21:44:36 UTC - in response to Message 43181.  

Sorry for the delayed response I was out of town most of the week



I'm not seeing what you describe, but here are the few lines of the VBox.log


00:00:10.586190 IEM: rdmsr(0x4e) -> #GP(0)
00:00:10.720435 APIC0: Switched mode to x2APIC
00:00:10.828486 PIT: mode=0 count=0x10000 (65536) - 18.20 Hz (ch=0)
00:00:10.863375 APIC1: Switched mode to x2APIC
00:00:10.863391 GIM: KVM: VCPU 1: Enabled system-time struct. at 0x000000014bf84040 - u32TscScale=0x9b26c76a i8TscShift=-1 uVersion=2 fFlags=0x1 uTsc=0x4d5156a93 uVirtNanoTS=0x176e06bcd
00:00:10.864397 IEM: rdmsr(0x4e) -> #GP(0)
00:00:12.276698 IEM: wrmsr(0xc90,0x0`000fffff) -> #GP(0)
00:00:13.344427 PIIX3 ATA: Ctl#0: RESET, DevSel=0 AIOIf=0 CmdIf0=0xc4 (-1 usec ago) CmdIf1=0x00 (-1 usec ago)
00:00:13.344508 PIIX3 ATA: Ctl#0: finished processing RESET
00:00:13.344924 PIIX3 ATA: Ctl#1: RESET, DevSel=0 AIOIf=0 CmdIf0=0xa1 (-1 usec ago) CmdIf1=0x00 (-1 usec ago)
00:00:13.344994 PIIX3 ATA: Ctl#1: finished processing RESET
00:00:15.339531 NAT: Link up
00:00:17.175622 VMMDev: Guest Additions information report: Version 5.2.32 r132073 '5.2.32'
00:00:17.175672 VMMDev: Guest Additions information report: Interface = 0x00010004 osType = 0x00053100 (Linux >= 2.6, 64-bit)
00:00:17.175813 VMMDev: Guest Additions capability report: (0x0 -> 0x0) seamless: no, hostWindowMapping: no, graphics: no
00:00:17.176096 VMMDev: Guest reported fixed hypervisor window at 00001800000 LB 0x2400000 (rc=VINF_SUCCESS)
00:00:17.176121 VMMDev: vmmDevReqHandler_HeartbeatConfigure: No change (fHeartbeatActive=false)
00:00:17.176141 VMMDev: Heartbeat flatline timer set to trigger after 4 000 000 000 ns
00:00:17.176199 VMMDev: Guest Log: vgdrvHeartbeatInit: Setting up heartbeat to trigger every 2000 milliseconds
00:00:17.176806 VMMDev: Guest Log: vboxguest: misc device minor 58, IRQ 20, I/O port d020, MMIO at 00000000f0400000 (size 0x400000)
00:00:17.576133 Display::i_handleDisplayResize: uScreenId=0 pvVRAM=000000000ac60000 w=800 h=600 bpp=32 cbLine=0xC80 flags=0x1 origin=0,0
00:00:21.852605 NAT: IPv6 not supported
00:00:22.141405 NAT: DHCP offered IP address 10.0.2.15
00:00:22.141752 NAT: DHCP offered IP address 10.0.2.15
00:00:24.308484 VMMDev: Guest Log: Checking CVMFS...
00:00:35.698186 VMMDev: Guest Log: VBoxService 5.2.32 r132073 (verbosity: 0) linux.amd64 (Jul 12 2019 10:32:28) release log
00:00:35.698208 VMMDev: Guest Log: 00:00:00.000211 main Log opened 2020-08-08T12:40:50.916026000Z
00:00:35.698272 VMMDev: Guest Log: 00:00:00.000313 main OS Product: Linux
00:00:35.698310 VMMDev: Guest Log: 00:00:00.000354 main OS Release: 3.10.0-957.27.2.el7.x86_64
00:00:35.698344 VMMDev: Guest Log: 00:00:00.000389 main OS Version: #1 SMP Mon Jul 29 17:46:05 UTC 2019
00:00:35.698383 VMMDev: Guest Log: 00:00:00.000424 main Executable: /opt/VBoxGuestAdditions-5.2.32/sbin/VBoxService
00:00:35.698391 VMMDev: Guest Log: 00:00:00.000426 main Process ID: 1490
00:00:35.698396 VMMDev: Guest Log: 00:00:00.000427 main Package type: LINUX_64BITS_GENERIC
00:00:35.703679 VMMDev: Guest Log: 00:00:00.005729 main 5.2.32 r132073 started. Verbose level = 0
00:00:35.704600 Guest Control: GUEST_MSG_REPORT_FEATURES: 0x1, 0x8000000000000000
00:00:45.706543 VMMDev: Guest Log: 00:00:10.008554 timesync vgsvcTimeSyncWorker: Radical guest time change: 25 211 221 087 000ns (GuestNow=1 596 915 662 144 209 000 ns GuestLast=1 596 890 450 923 122 000 ns fSetTimeLastLoop=true )
00:10:11.861076 Display::i_handleDisplayResize: uScreenId=0 pvVRAM=000000000ac60000 w=800 h=600 bpp=0 cbLine=0xC80 flags=0x5 origin=0,0
ID: 43199 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,084,038
RAC: 105,553
Message 43200 - Posted: 9 Aug 2020, 6:24:21 UTC

Your Guest-Edition is 5.2.32 and your Virtualbox is 6.0.14.
Rosetta, Milkyway (GPU) running well, I think WCG also.
Please upgrade Guest-Edition to 6.0.14.
It can be this combination of RAM (16 GByte) and other projects using.
You have Atlas and Theory finishing correct, but only a few.
A overload of Boinc is possible, to have this Errors in LHC.
ID: 43200 · Report as offensive     Reply Quote
Henry Nebrensky

Send message
Joined: 13 Jul 05
Posts: 165
Credit: 14,925,288
RAC: 34
Message 43201 - Posted: 9 Aug 2020, 12:16:31 UTC - in response to Message 43199.  

I'm not seeing what you describe, but here are the few lines of the VBox.log...

A look some of the failed Theory tasks reveals they all report something like
2020-08-03 22:57:53 (13876): Guest Log: Probing /cvmfs/alice.cern.ch... Failed!
suggesting the VM is seeing network issues. Have you worked through Yeti's checklist on that machine?
ID: 43201 · Report as offensive     Reply Quote
keputnam

Send message
Joined: 27 Sep 04
Posts: 102
Credit: 7,086,947
RAC: 1,340
Message 43225 - Posted: 17 Aug 2020, 6:46:00 UTC - in response to Message 43200.  

Thanks, Guys


Guest Edition?

Not familiar with that term

Already at VBox 6.0.14 with Extension Pack 6.0.14


I'll review Yeti's checklist one more time
ID: 43225 · Report as offensive     Reply Quote
keputnam

Send message
Joined: 27 Sep 04
Posts: 102
Credit: 7,086,947
RAC: 1,340
Message 43227 - Posted: 17 Aug 2020, 23:58:39 UTC
Last modified: 18 Aug 2020, 0:01:13 UTC

Double checked Yeti's check list - all good

Called my ISP and asked if there was anything they could do to reduce latency

They couldn't access the modem control panel from their end, and I couldn't access it from within my home network

They also said it was over 7 years old and offered to send me a new one, which should be here WED afternoon sometime

Let me get it hooked up and set "accept new tasks" here and lets see what happens



Thanks for all the responses so far

Really appreciated
ID: 43227 · Report as offensive     Reply Quote
keputnam

Send message
Joined: 27 Sep 04
Posts: 102
Credit: 7,086,947
RAC: 1,340
Message 43241 - Posted: 20 Aug 2020, 21:29:09 UTC

Well got the replacement modem, got it installed and worked with a Supervisor to get it back to a measures speed of the old one. When it was first installed, it was about half! (the first guy was an absolute idiot who didn't listen to a thing I told him - Yeah, Indian sub-continent))

First WU completed

The next two I had to abort
of note, BOTH of those two were aborted by other users, as well

I'm currently at max bandwidth, for my ISP this area (small neighborhood separated from the rest of the city by a river) so may have to shop for another carrier

(Yeah I was told "we are upgrading our service throughout the country and surveys are underway even now HaHaHa)


Guess I need to shop around and see if anyone else has better bandwidth, here



Looks like for the meantime, I either don't run Atlas, or do and accept a success rate around 1 in 5 or 1 in 8 though that's not really fair to the project or other crunchers



<sigh>
ID: 43241 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,889,766
RAC: 138,323
Message 43243 - Posted: 21 Aug 2020, 8:46:45 UTC - in response to Message 43241.  

Could you post some values from your network and your internet connection?


LAN

Cable or wi-fi?
Bandwidth, e.g. cable 100 Mbit/s or 1 Gbit/s



Internet

Typical sync between your (new) modem and your ISP?
Download bandwidth: xxxxx Mbit/s
Upload bandwidth: yyyyy Mbit/s
Ping to cms-frontier.openhtc.io: xx ms

(if cms-frontier.openhtc.io doesn't reply you may try Google's DNS "ping 8.8.8.8")
ID: 43243 · Report as offensive     Reply Quote
keputnam

Send message
Joined: 27 Sep 04
Posts: 102
Credit: 7,086,947
RAC: 1,340
Message 43244 - Posted: 21 Aug 2020, 16:09:42 UTC - in response to Message 43243.  

copper to the house, WiFi throughout Contract is 24Mb Modem/Router is a Two Wire
Speed from desktop to the router from 56 to over 120 Mbps, currently 72, which is about average My to-do list for the day includes making a few phone calls to other ISPs

The supervisor I talked to suggested trying copper from router to the desktop to see if if made any difference was going to get some the next-time I was out and about If it helps, I've then got to figure out how to permanently run it frim the Kitchen (middle of the house) to my den

C:\WINDOWS\system32>ping 8.8.8.8

Pinging 8.8.8.8 with 32 bytes of data:
Reply from 8.8.8.8: bytes=32 time=25ms TTL=117
Reply from 8.8.8.8: bytes=32 time=31ms TTL=117
Reply from 8.8.8.8: bytes=32 time=25ms TTL=117
Reply from 8.8.8.8: bytes=32 time=24ms TTL=117







from fast.com
Your Internet speed is
16
Mbps


Latency
Unloaded
25 ms

Loaded
125 ms


Upload
Speed
4.2 Mbps

ID: 43244 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,889,766
RAC: 138,323
Message 43252 - Posted: 22 Aug 2020, 10:38:26 UTC - in response to Message 43244.  

copper to the house, WiFi throughout Contract is 24Mb Modem/Router is a Two Wire
Speed from desktop to the router from 56 to over 120 Mbps, currently 72, which is about average My to-do list for the day includes making a few phone calls to other ISPs

The supervisor I talked to suggested trying copper from router to the desktop to see if if made any difference was going to get some the next-time I was out and about If it helps, I've then got to figure out how to permanently run it frim the Kitchen (middle of the house) to my den

C:\WINDOWS\system32>ping 8.8.8.8

Pinging 8.8.8.8 with 32 bytes of data:
Reply from 8.8.8.8: bytes=32 time=25ms TTL=117
Reply from 8.8.8.8: bytes=32 time=31ms TTL=117
Reply from 8.8.8.8: bytes=32 time=25ms TTL=117
Reply from 8.8.8.8: bytes=32 time=24ms TTL=117







from fast.com
Your Internet speed is
16
Mbps


Latency
Unloaded
25 ms

Loaded
125 ms


Upload
Speed
4.2 Mbps

The most relevant values are:
Internet speed (down): 16 Mbps
Internet speed (up): 4.2 Mbps
Latency (unloaded): 25 ms

Not among the top values but enough to run a few ATLAS tasks.
You may calculate per task:
430 MB initial download -> 215 s
150 MB result upload -> 300 s

While these transfers are in progress they slow down all other internet traffic, even uploads slow down downloads and vice versa.


The unloaded latency is not too bad but becomes very bad as soon as your wi-fi is under load (25 ms -> 117 ms).
This influences the setup phase of each task since then roughly 2000 small and medium sized files are to be downloaded (most likely while the result upload of the previous task is still in progress).


I agree with the suggestion from your ISP's supervisor:
Copper from the router to the desktop would be much more stable and reliable.

In addition you should ensure that only very few tasks are in the same phase at the same time.
ID: 43252 · Report as offensive     Reply Quote
keputnam

Send message
Joined: 27 Sep 04
Posts: 102
Credit: 7,086,947
RAC: 1,340
Message 43255 - Posted: 22 Aug 2020, 16:20:15 UTC - in response to Message 43252.  

Thanks for the response

I'll get some CAT6 today or tomorrow

I only run 1 LHC task at time, but have BOINC set to use 10 of my 12 "processors" (Hex core/MT)


The last two ATLAS tasks ran to completion, waiting for another at this point

Curiously, one of them kind of lugged along getting 1-2 sec of CPU time per 20 seconds of Elapsed, then at about 8:30 took off and ran to completion!
ID: 43255 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,889,766
RAC: 138,323
Message 43256 - Posted: 22 Aug 2020, 17:43:57 UTC - in response to Message 43255.  

The scientific app runs inside a VM and is therefore hidden.
Hence, the BOINC client doesn't really show you what's going on and can't calculate reliable time estimations.
2020-08-21 15:01:51 (15880): Adding VirtualBox Guest Additions to VM.

This line from one of your logfiles shows that VirtualBox Guest Additions are installed on your computer.
They allow you to look directly into the VM - at least at some of the linux consoles there.

At BOINC Manager's task view highlight the task you want to examine.
Then on the left side click at "show virtual console".
A new window opens which can be controlled via the keys "ALT + F1" ... "ALT + Fn".

ALT + F2
ATLAS event monitoring
Shows/Calculates a couple of numbers based on the scientific app's original logfiles as soon as they become available.
Be patient. This can take a while.

ALT + F3
The linux "top" process
ID: 43256 · Report as offensive     Reply Quote
keputnam

Send message
Joined: 27 Sep 04
Posts: 102
Credit: 7,086,947
RAC: 1,340
Message 43257 - Posted: 23 Aug 2020, 6:37:05 UTC - in response to Message 43256.  

So when the task "locks up" and gets no more CPU time, as verified by Windows Resource Manager, what am I looking for?


Current WU is at 13:48:24 and crunching happily along
ID: 43257 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,889,766
RAC: 138,323
Message 43258 - Posted: 23 Aug 2020, 7:46:37 UTC - in response to Message 43257.  

So when the task "locks up" and gets no more CPU time, as verified by Windows Resource Manager, what am I looking for?

Each ATLAS task has 3 main phases.
1. setup
2. calculation
3. stage-out


Although you use a 2-core setup phases (1.) and (3.) run on a single core, sometimes not even on a full core, e.g. while downloads are in progress or during disk I/O.
The 2nd core remains allocated by the VM but runs idle.

An idle core can also be seen at the end of phase (2.) due to the following reason.
Each task transfers 200 events to hits.
Each transfer is done by a separate worker thread and needs between a few seconds and far more than 1200 seconds.
Now guess 199 events are finished and the last one needs 1200 s.
This will also leave 1 core of your VM idle.


Phase (2.) can be identified at console 3 (ALT + F3) that shows 2*) "athena.py" processes each running at nearly 100% CPU.
Console 2 (ALT + F2) shows how many events are done and tries to estimate the time left.
This estimation is based on statistical values since nobody knows how much time the last event really needs.



*) n in case of an n-core setup
ID: 43258 · Report as offensive     Reply Quote
keputnam

Send message
Joined: 27 Sep 04
Posts: 102
Credit: 7,086,947
RAC: 1,340
Message 43263 - Posted: 24 Aug 2020, 5:45:12 UTC

Well, running copper seems to have reduced latency considerably

Your Internet speed is
29Mbps

Latency
Unloaded
23 ms

Loaded
35 ms

Upload
Speed
4.6 Mbps



Finished one WU after installing the CAT6, and the next is now an hour and a half into execution, crunching along quite nicely



Thanks for all the assistance, guys

ID: 43263 · Report as offensive     Reply Quote
Previous · 1 · 2

Message boards : ATLAS application : Processor Time Locks Up Elapsed Time Continues to Climb


©2024 CERN