Questions and Answers : Windows : Dual CPU Xeon Windows 10 - configuration for LHC?
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Bradders

Send message
Joined: 3 Jan 17
Posts: 13
Credit: 497,855
RAC: 0
Message 42932 - Posted: 1 Jul 2020, 0:59:24 UTC

I previously had issues running large WU in part because 8 CPUs (2 physical x 4 cores) need more than 8GB to share. https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4281#30435
At the time I followed the Checklist 3 from Yeti, but only went so far before the lack of RAM became the limit, so I disconnected from LHC. https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4161#29359

Since then (2017) I have rebuilt on a clean install of Windows 10 Pro and I now have 32GB. But just then, the PC just spat out a bunch of Theory Simulation v300.06 (vbox64_theory) windows_x86_64 WUs after only 7.8s. Note: It does seem that everyone who has tried one of those WU has finished with an error.

I'll work through the rest of checklist 3 to see if I can get it working again. Is there a more recent checklist to configure a Windows PC for LHC?
ID: 42932 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1114
Credit: 49,503,029
RAC: 3,972
Message 42942 - Posted: 1 Jul 2020, 19:48:10 UTC - in response to Message 42932.  

Try a VB update to a new version and see if that helps since yours is running one from last year now.
(and Extension Pack)

https://www.virtualbox.org/wiki/Download_Old_Builds_6_1
ID: 42942 · Report as offensive     Reply Quote
Bradders

Send message
Joined: 3 Jan 17
Posts: 13
Credit: 497,855
RAC: 0
Message 42944 - Posted: 1 Jul 2020, 23:50:34 UTC - in response to Message 42942.  

Thanks
Since those failed jobs I've installed VB 6.1.10 (latest version) and I just installed its extension pack.
Awaiting a job from LHC...
ID: 42944 · Report as offensive     Reply Quote
Bradders

Send message
Joined: 3 Jan 17
Posts: 13
Credit: 497,855
RAC: 0
Message 42949 - Posted: 3 Jul 2020, 2:34:07 UTC - in response to Message 42944.  

I now have 8 Theory jobs running in parallel. They have run for about 1 day and have about 9 days left to run. (I've just marked other projects as "no new work", which might speed things up a bit.)
Memory used is 16 GB of 32 GB.
So far so good.
ID: 42949 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1114
Credit: 49,503,029
RAC: 3,972
Message 42960 - Posted: 3 Jul 2020, 22:57:35 UTC - in response to Message 42949.  

Yeah Theory tasks don't use much Memory so 8 will run on 16GB easily and I have had as many as 30 on 28GB Ram

Are those still running......since it has been 20 more hours since you said they had one day?

That "9 days" estimate is not actually how long they will take but it will say that on longer ones

You can take a look at them on your VB Manager to see if it has all 8 running and on any of the running tasks just point at click on one and right click and then click on "Show Log" and you will see what it is doing.
ID: 42960 · Report as offensive     Reply Quote
Bradders

Send message
Joined: 3 Jan 17
Posts: 13
Credit: 497,855
RAC: 0
Message 42978 - Posted: 8 Jul 2020, 4:48:59 UTC - in response to Message 42960.  

I opened VB and looked at the logs. I have no idea what to look for.
Task Manager was at 100% across the 8 cores.
BOINC showed 8 WUs ticking along. (The expected time was 11 days.)
I had to restart that PC, so I suspended BOINC first and waited until the VB jobs to 'suspend' (I can't remember the exact term.)

After restart, the VB Manager says that the 8 WU are running, but there is not much CPU action; just a blip across all 8 cores every 20s or so.
Task Manager shows BOINC tasks (12), VB Manager and about 20 VB Headless Frontend tasks, but not a lot of action.

Did I kill the delicate flowers?
ID: 42978 · Report as offensive     Reply Quote
Bradders

Send message
Joined: 3 Jan 17
Posts: 13
Credit: 497,855
RAC: 0
Message 42979 - Posted: 8 Jul 2020, 4:57:23 UTC - in response to Message 42932.  

Sample log output from one of the WU. (The other 7 WU don't have the last two lines in their VBox.log.n logs, but otherwise look similar.):
VBox.log
00:00:48.722666 Display::i_handleDisplayResize: uScreenId=0 pvVRAM=0000000007d10000 w=800 h=600 bpp=32 cbLine=0xC80 flags=0x1 origin=0,0
00:00:48.723117 Changing the VM state from 'LOADING' to 'SUSPENDED'
00:00:48.723224 Changing the VM state from 'SUSPENDED' to 'RESUMING'
00:00:48.723471 NAT: Link down
00:00:48.723504 Changing the VM state from 'RESUMING' to 'RUNNING'
00:00:48.723583 Console: Machine state changed to 'Running'
00:00:51.905008 NAT: Link up
00:00:51.910992 NAT: DNS#0: 192.168.178.1
00:00:53.514000 NAT: IPv6 not supported
03:39:25.016189 NAT: DHCP offered IP address 10.0.2.15
13:33:26.537610 NAT: DHCP offered IP address 10.0.2.15
23:41:27.358912 NAT: DHCP offered IP address 10.0.2.15
34:04:21.029286 NAT: DHCP offered IP address 10.0.2.15
43:18:55.491941 NAT: DHCP offered IP address 10.0.2.15

VBoxHardening.log
1ca4.1448: supR3HardenedWinVerifyCacheScheduleImports: Import todo: #21 'ws2_32.dll'.
1ca4.1448: supR3HardenedWinVerifyCacheScheduleImports: Import todo: #23 'nsi.dll'.
1ca4.1448: supHardenedWinVerifyImageByHandle: -> 0 (\Device\HarddiskVolume2\Windows\System32\dnsapi.dll)
1ca4.1448: supR3HardenedWinVerifyCacheInsert: \Device\HarddiskVolume2\Windows\System32\dnsapi.dll
1ca4.1448: supR3HardenedDllNotificationCallback: load 00007ffab5500000 LB 0x000cb000 C:\Windows\SYSTEM32\DNSAPI.dll [fFlags=0x0]
1ca4.1448: supR3HardenedScreenImage/LdrLoadDll: cache hit (VINF_SUCCESS) on \Device\HarddiskVolume2\Windows\System32\dnsapi.dll [avoiding WinVerifyTrust]
1ca4.295c: '\Device\HarddiskVolume2\Windows\System32\tzres.dll' has no imports
1ca4.295c: supHardenedWinVerifyImageByHandle: -> 22900 (\Device\HarddiskVolume2\Windows\System32\tzres.dll)
1ca4.295c: supR3HardenedWinVerifyCacheInsert: \Device\HarddiskVolume2\Windows\System32\tzres.dll
1ca4.295c: supR3HardenedMonitor_NtCreateSection: NtMapViewOfSection failed on 0000000000000ddc (hFile=0000000000000dfc) with 0xc0000022 -> STATUS_TRUST_FAILURE
1ca4.295c: supR3HardenedScreenImage/NtCreateSection: cache hit (Unknown Status 22900 (0x5974)) on \Device\HarddiskVolume2\Windows\System32\tzres.dll [avoiding WinVerifyTrust]
1ca4.295c: supR3HardenedMonitor_NtCreateSection: NtMapViewOfSection failed on 0000000000000dfc (hFile=0000000000000ddc) with 0xc0000022 -> STATUS_TRUST_FAILURE

VBox.log.1
00:02:56.742450 GIM: KVM: Resetting MSRs
00:02:56.743897 Changing the VM state from 'DESTROYING' to 'TERMINATED'
00:02:56.746486 Console: Machine state changed to 'Saved'
00:02:57.548213 GUI: Passing request to close Runtime UI from machine-logic to UI session.
00:02:57.548373 GUI: UIMediumEnumerator: Medium-enumeration finished!

VBox.log.2
00:04:56.067672 GIM: KVM: Resetting MSRs
00:04:56.069116 Changing the VM state from 'DESTROYING' to 'TERMINATED'
00:04:56.071432 Console: Machine state changed to 'Saved'
00:04:56.967180 GUI: Passing request to close Runtime UI from machine-logic to UI session.
00:04:56.967888 GUI: UIMediumEnumerator: Medium-enumeration finished!

VBox.log.3
00:23:28.762098 GIM: KVM: Resetting MSRs
00:23:28.763427 Changing the VM state from 'DESTROYING' to 'TERMINATED'
00:23:28.765871 Console: Machine state changed to 'Saved'
00:23:29.832781 GUI: Passing request to close Runtime UI from machine-logic to UI session.
00:23:29.842720 GUI: UIMediumEnumerator: Medium-enumeration finished!
ID: 42979 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1114
Credit: 49,503,029
RAC: 3,972
Message 42980 - Posted: 8 Jul 2020, 6:49:09 UTC - in response to Message 42978.  
Last modified: 8 Jul 2020, 6:54:25 UTC

I opened VB and looked at the logs. I have no idea what to look for.
Task Manager was at 100% across the 8 cores.
After restart, the VB Manager says that the 8 WU are running, but there is not much CPU action; just a blip across all 8 cores every 20s or so.
Did I kill the delicate flowers?


YES VB tasks can be like a wilted weed once in a while (or more if you don't have a FAST ISP)

These Theory tasks use your Ram but not much CPU and when you check your Task Manager the CPU you see running is mainly just what it takes to run your pc and the OS

So what you will see there is if it still has about 16GB ram running when you have the 8 tasks running on your pc.

And you know how they can bee if you ever have to reboot (especially when Windows 10 does a Update reboot without asking you first)

I have been running these VB tasks daily for 10 years now so I have got to have the fun several times every week.

I just caught one of mine that decided to do a Update reboot today so I lost 3 of the 8 running tasks and one was at about 30 hours so I wasn't very happy. (and when you ever have that happen the best thing to do when you finish that reboot when they ask for your PIN first it is best to suspend them asap)

BUT you did your reboot the correct way (checking the VB Manager first)
Now that they are running again the best thing to check is your Boinc Manager by clicking on one of the running tasks and check the current VM Console (on the left side of the page) and it should if it is working be back to normal.

If not it will just say it FAILED and the bad thing is when that happens these tasks are famous to just keep on running until you see it and Abort it.

So check each one at a time and see if they are Failed or still say this at the bottom of the Console page

[INFO] ===> [runRivet] Time/date XXX [boinc pp jets 7000 10 - event generatorXXX
if it says [runRivet].....and the time/date and the particular event generator you will know it is working (most likely pythia)

IF any tasks say this on the Console then that task should just be Aborted since it will never finish as Valid and you will just be wasting your time


BTW I only run about 8 Theory tasks here each day and run hundreds of the test version on the test site BUT lately I have been getting the short ones AND some of the rare LONG ones and finished a couple Valids over 60 hours each so watch for those but you will never get a Valid one to run much longer than that and that 9hr estimations is just a wild Boinc guess and that never happens (same with the 11hr est.)

Good luck
ID: 42980 · Report as offensive     Reply Quote
Bradders

Send message
Joined: 3 Jan 17
Posts: 13
Credit: 497,855
RAC: 0
Message 42982 - Posted: 9 Jul 2020, 6:01:40 UTC - in response to Message 42980.  

Sadly, there were all in that 'probing failed' state. I shut them all down and updated BOINC to clear them out.
My Account shows run time and CPU times. Example: Run Time 474,926.15 CPU Time 10,107.56

I'll keep working through the configuration list, and I'll post here if/when I get another LHC job.
Thanks for your guidance.
ID: 42982 · Report as offensive     Reply Quote
Bradders

Send message
Joined: 3 Jan 17
Posts: 13
Credit: 497,855
RAC: 0
Message 42994 - Posted: 10 Jul 2020, 2:42:48 UTC - in response to Message 42982.  

And another 8 WUs have just started, and all have the same 'probing functions' errors.
I'll disable LHC until I can get it configured.[/img]
ID: 42994 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,144,248
RAC: 105,364
Message 42995 - Posted: 10 Jul 2020, 3:57:00 UTC - in response to Message 42994.  

You can cancel your Theory-Tasks, because of a long unsuccessful runtime.

This is a problem in your configuration for Theory tasks.
Please, set your prefs for Theory to ONE Task and ONE Cpu using.
If this Task also have problems, you can go back to Virtualbox 6.0.22 or 6.1.8 (with the extension pack).
Do you have a app_config.xml? If not it's ok.
ID: 42995 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,955,814
RAC: 136,935
Message 42996 - Posted: 10 Jul 2020, 6:05:02 UTC - in response to Message 42994.  

2020-07-02 11:44:43 (3204): Guest Log: 03:44:44 CEST +02:00 2020-07-02: cranky: [INFO] Checking CVMFS.
2020-07-02 11:45:28 (3204): Guest Log: Probing /cvmfs/sft.cern.ch... Failed!
2020-07-02 11:45:28 (3204): Guest Log: 03:45:29 CEST +02:00 2020-07-02: cranky: [ERROR] 'cvmfs_config probe sft.cern.ch' failed.

CVMFS communicates via HTTP.
Since all of your tasks fail at the same point it looks like those HTTP packets (or others, e.g. DNS) are either blocked or the response is always delayed.

You may check:
- Do you run a firewall (or a "security" app) that blocks network packets from BOINC, VirtualBox or the user that runs LHC tasks?
- Is your computer connected to your LAN via wi-fi (less good) or LAN cable (better)?
- Could you tell us some specs of your internet connection; download/upload bandwidth, latency (ping)?
ID: 42996 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1114
Credit: 49,503,029
RAC: 3,972
Message 43014 - Posted: 10 Jul 2020, 21:00:20 UTC
Last modified: 10 Jul 2020, 21:18:17 UTC

When that happens it is always the internet speed and especially trying to start several (8) at a time.
With Windows 10 you can watch the Task Manager - Performance - Internet Speed when you start a single Theory task.

If it is not close to 3Mbps or better you will have that same problem every time.
And it is even worse than that if you try running a CMS

Unless you have the fastest internet there is no way that 8 tasks will all start up and make it to runRivet
At my fastest I have 40-50Mbps and I STILL can not have 8 of these start at the same time without the failure
at one of the 4 *probes*

I have run thousands of these and I actually watch EVERY one start from start to runRivet on the Console so I know in the first 3 minutes if it will run or Fail.

Example:


In a perfect task world here these tasks would abort themselves after running less than 5 minutes after they FAIL instead of continuing running for days for no reason and until the user Aborts them.
ID: 43014 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1114
Credit: 49,503,029
RAC: 3,972
Message 43019 - Posted: 11 Jul 2020, 2:56:23 UTC
Last modified: 11 Jul 2020, 3:02:31 UTC

(that server sure is picky about even giving us any Theory tasks right now)

Another thing I should add is if you want to run Theory VB tasks and don't have a real fast isp you have to be careful and even watch to make sure they are actually going to run.

As I mentioned when my isp throttles my speed down (satellite isp) like right now my speed will not run any faster than 3.1Mbps
So when I am starting up my daily batch of tasks and since I have 10 pc's running I have to only have one connected to the internet at a time (you can suspend tasks on another pc if you want to start some on another one)

I just finished my new batch BUT with only 3.1Mbps I can only start one task and run it until it is beyond cranky:[INFO] ===> [runRivet] on your VM Console ........and since I am loading and starting 8 cores of work (and many more) I have to suspend that one that is running to start up the next one or I will end up with a Failed task.

That can happen in the first 10 seconds or after 3 minutes after one of the "probes" don't get the OK....and many times at about the 3 min. running mark if it doesn't get theOK at the alice.cern.ch probe it will Fail (best when you get that far by 2.5mins)

So with a slow connection don't depend on even these easier Theory tasks to run like a SixTrack runs (they are not internet dependent) and even harder with CMS (if they are actually working on a Windows OS when you try them)

Now if you live in Europe internet is not as much of a problem since you are close enough to Geneva
But in the western hemisphere that data has to leave your modem and go up to satellite and back to servers and back to satellite until it makes it to Cern and back........and it isn't at the speed of light. (especially since we have hundreds of millions here on the WWW 24/7 watching videos and wasting internet speed)

And Australia is a long trip to Geneva too. since the world isn't flat

And this has nothing to do with d/ling new tasks or vdi's .......whole different story.
ID: 43019 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,955,814
RAC: 136,935
Message 43021 - Posted: 11 Jul 2020, 7:26:44 UTC - in response to Message 43019.  

It doesn't really matter whether you are located in Europe, North America, Australia or elsewhere (except maybe Antarctica).
The VMs have been reconfigured long ago to use Cloudflare's CDN.
Since then the majority of all requests (all in case of CVMFS!) are answered by a Cloudflare proxy close to the point where your ISP is connected to the internet.

The bottleneck is an overloaded link between your modem/router and your ISP's backbone.
Overloaded is not always related to low bandwidth. In case of a satellite link its most likely caused by high latencies.
ID: 43021 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,144,248
RAC: 105,364
Message 43023 - Posted: 11 Jul 2020, 7:47:20 UTC
Last modified: 11 Jul 2020, 7:48:32 UTC

When reducing the bandwidth in the Boincmanager prefs-> Network Setting-> Downloadsize to 3100 Mbit/sec:
11.07.2020 08:40:11 | | max download rate: 3174400 bytes/sec
Explanation in Boinc-Doku:
Network
Usage limits
Limit download rate to N KB/second: Limit the download rate of file transfers.

two Theory-tasks running concurrently successful. They are starting with a difference of 10 sec after download.
Under show Graphic in Boincmanger (on the left side) they can be controlled in the Webbrowser. (logfile)
ID: 43023 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,955,814
RAC: 136,935
Message 43027 - Posted: 11 Jul 2020, 8:34:26 UTC - in response to Message 43023.  
Last modified: 11 Jul 2020, 8:41:40 UTC

Disadvantage:
BOINC keeps it's own bandwidth low even if the network switches back to a higher speed, e.g. "monthly ISP bonus".


This option may be useful if your nominal bandwidth is higher than the setting you enter here.
In that case you can guarantee a minimum bandwidth shareable by all non BOINC processes using the network.

Example:
Nominal bw: 5000 kbit/s
BOINC limit: 3000 kbit/s
free bw: 5000-3000 kbit/s => 2000 kbit/s


As soon as the slowest network section gets saturated packets queue up in front of the bottleneck and are (usually) processed in the order they arrive at the bottleneck.
This causes additional delays and a process waiting for a response may run into a timeout.


<edit>
I'm not sure if this option limits the network traffic from VMs or just the traffic from the BOINC client itself.
</edit>
ID: 43027 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1114
Credit: 49,503,029
RAC: 3,972
Message 43028 - Posted: 11 Jul 2020, 8:46:37 UTC - in response to Message 43021.  
Last modified: 11 Jul 2020, 8:50:54 UTC



The bottleneck is an overloaded link between your modem/router and your ISP's backbone.
Overloaded is not always related to low bandwidth. In case of a satellite link its most likely caused by high latencies.


I know you don't understand microwave communications and that is about as far as you have got (latency)

As I told you before I used to be the owner of a company and was the microwave communications engineer.
And if you did know how they work you would know they don't use the same server in the same area all the time and in fact they move all over the country.

And if you watched the Console thousands of times like I have you would see that clock running and that if they don't connect in the first 10 seconds they end up as Failed and then the clock running and if they don't start running before that time is up they also Fail and then there is the final page where you see the 4 'probes" happens and there they also Fail if not in less than 3 minutes.

This is why I have made snapshots of all of this over and over and posted them.

So YES it does make a difference but you don't know that because you have no experience doing that.

And the world isn't flat.

But go ahead and I have spent more than enough time here for 16 years and running VB tasks since the very first that we got them to work with the Windows OS's and was the one who had more Valids at T4T than any other person.

Disadvantage:
BOINC keeps it's own bandwidth low even if the network switches back to a higher speed, e.g. "monthly ISP bonus".


You are also wrong there but I won't waste any more time with this other than to say when I have that high speed I can start 4 at a time and they do not Fail BUT if I don't have that speed.....well I already told you that part.

and say what you must but I won't waste any more time here and won't see a word if you decided to continue this
ID: 43028 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,144,248
RAC: 105,364
Message 43041 - Posted: 13 Jul 2020, 9:28:12 UTC - in response to Message 43027.  

Disadvantage:
BOINC keeps it's own bandwidth low even if the network switches back to a higher speed, e.g. "monthly ISP bonus".


This option may be useful if your nominal bandwidth is higher than the setting you enter here.
In that case you can guarantee a minimum bandwidth shareable by all non BOINC processes using the network.

Example:
Nominal bw: 5000 kbit/s
BOINC limit: 3000 kbit/s
free bw: 5000-3000 kbit/s => 2000 kbit/s


As soon as the slowest network section gets saturated packets queue up in front of the bottleneck and are (usually) processed in the order they arrive at the bottleneck.
This causes additional delays and a process waiting for a response may run into a timeout.


<edit>
I'm not sure if this option limits the network traffic from VMs or just the traffic from the BOINC client itself.
</edit>

When reducing the bandwidth in the Boincmanager prefs-> Network Setting-> Downloadsize to 3100 Mbit/sec:
11.07.2020 08:40:11 | | max download rate: 3174400 bytes/sec
Explanation in Boinc-Doku:
Network
Usage limits
Limit download rate to N KB/second: Limit the download rate of file transfers.

two Theory-tasks running concurrently successful. They are starting with a difference of 10 sec after download.
Under show Graphic in Boincmanger (on the left side) they can be controlled in the Webbrowser. (logfile)


Have controlled with the Taskmanager. Download is limited to 3100 Mbit/s.
Shown in the Taskmanager of Windows!
ID: 43041 · Report as offensive     Reply Quote
Henry Nebrensky

Send message
Joined: 13 Jul 05
Posts: 165
Credit: 14,925,288
RAC: 34
Message 43043 - Posted: 13 Jul 2020, 11:25:27 UTC - in response to Message 43041.  

When reducing the bandwidth in the Boincmanager prefs-> Network Setting-> Downloadsize to 3100 Mbit/sec:
...
Have controlled with the Taskmanager. Download is limited to 3100 Mbit/s.
Shown in the Taskmanager of Windows!
Does this limit also apply to CVMFS downloads from within the VM, or only to traffic for the Boinc executable itself?
ID: 43043 · Report as offensive     Reply Quote
1 · 2 · Next

Questions and Answers : Windows : Dual CPU Xeon Windows 10 - configuration for LHC?


©2024 CERN