Message boards : CMS Application : CMS Tasks Failing
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 18 · 19 · 20 · 21 · 22 · Next

AuthorMessage
maeax

Send message
Joined: 2 May 07
Posts: 2259
Credit: 175,149,420
RAC: 69,319
Message 43355 - Posted: 14 Sep 2020, 11:33:20 UTC - in response to Message 43354.  

vt-x need to be enabled in the BIOS of a Intel-PC,
also Hyper-V in Windows need to be DISABLED.
After a reboot and other Errors, please report it.
ID: 43355 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1071
Credit: 8,276,900
RAC: 7,955
Message 43356 - Posted: 14 Sep 2020, 12:16:55 UTC - in response to Message 43355.  

vt-x need to be enabled in the BIOS of a Intel-PC,
also Hyper-V in Windows need to be DISABLED.
After a reboot and other Errors, please report it.

Is there a good recipe for disabling Hyper-V? I tried several methods found on the Web, but still I get
Virtualization Virtualbox (6.1.12) installed, CPU does not have hardware virtualization support
in https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10653693
I thought it might have been because I had Windows Subsystem for Linux installed, but after I
removed that, Hyper-V still comes back every time I boot.
ID: 43356 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2259
Credit: 175,149,420
RAC: 69,319
Message 43357 - Posted: 14 Sep 2020, 13:16:17 UTC - in response to Message 43356.  

Have only one Intel(HP) and there is HYPER-V in Windows-Features not enabled all other PC are AMD (SVM for Virtualization in BIOS).
No idea myself, why Hyper-V is enabled after reboot, sorry.
ID: 43357 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2572
Credit: 259,299,547
RAC: 107,057
Message 43358 - Posted: 15 Sep 2020, 7:09:16 UTC - in response to Message 43356.  

Is there a good recipe for disabling Hyper-V?

A recent comment posted by Microsoft:
https://docs.microsoft.com/en-us/troubleshoot/windows-client/application-management/virtualization-apps-not-work-with-hyper-v
ID: 43358 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1071
Credit: 8,276,900
RAC: 7,955
Message 43359 - Posted: 15 Sep 2020, 12:46:26 UTC - in response to Message 43358.  

Is there a good recipe for disabling Hyper-V?

A recent comment posted by Microsoft:
https://docs.microsoft.com/en-us/troubleshoot/windows-client/application-management/virtualization-apps-not-work-with-hyper-v

Thanks. I've done most of those I think, but I'll go through it step-by-step.
ID: 43359 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 43806 - Posted: 8 Dec 2020, 18:05:44 UTC

After a long time I got a CMS task, which ended in failure. Condor ended in 10656 s. Is that right?
Tullio
ID: 43806 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1071
Credit: 8,276,900
RAC: 7,955
Message 43844 - Posted: 11 Dec 2020, 9:10:33 UTC - in response to Message 43806.  

After a long time I got a CMS task, which ended in failure. Condor ended in 10656 s. Is that right?
Tullio

CPU time seems credible for running one job, but unfortunately there's not enough information in your log file to say why it then failed. I'm seeing suspicions of network problems overall, but nothing concrete to put my finger on just yet.
ID: 43844 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 43851 - Posted: 11 Dec 2020, 12:33:54 UTC
Last modified: 11 Dec 2020, 12:34:36 UTC

I am runnig both Atlas, Theory and Sixtrack on this same CPU, plus QuChemPedIA@home, all using VirtualBox save SixTrack and all run well, On QuChem, using VirtualBox because it is a Linux project. I am faster than most Linux CPUs, even those with 128 processors. I am using a 6 processor Intel i5 9400F CPU. I have the rank 51 in RAC.
Tullio
ID: 43851 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 43878 - Posted: 12 Dec 2020, 15:42:18 UTC

Another CMS task failed exactly with the same message. I am now running 4 Theory tasks. I had to remove the McAfee antivirus program to run Atlas tasks and am now using Windows Defender.
Tullio
ID: 43878 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 44011 - Posted: 27 Dec 2020, 13:19:19 UTC

Condor ended after 10637 seconds. Atlas and Theory tasks all complete. QuChemPedIA@home . using VirtualBox, run perfeclttly. I am number 50 in RAC rank, although my Intel i5 CPU is far inferior to Intel i7 and AMD Ryzen Threadripper CPUs running Linux. They take a longer time without using VirtualBox.
Tullio.
ID: 44011 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2259
Credit: 175,149,420
RAC: 69,319
Message 44037 - Posted: 1 Jan 2021, 11:52:46 UTC - in response to Message 44011.  

Tullio,
can you ping vocms0267.cern.ch on a shell.
This Condor-Server is using Port 9618 when CMS Task is running.
ID: 44037 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 44045 - Posted: 2 Jan 2021, 6:24:24 UTC - in response to Message 44037.  
Last modified: 2 Jan 2021, 6:30:42 UTC

Pinged it from a Linux virtual machine on a Windows 10 host.
27 packets transmitted, 0 packet loss.
Thanks anyway
Tullio
wifi at 5 GHz
ID: 44045 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2572
Credit: 259,299,547
RAC: 107,057
Message 44048 - Posted: 2 Jan 2021, 8:23:58 UTC - in response to Message 44045.  

At the end of a subtask calculation CMS uploads a 120 MB result file directly from inside the VM to vc-cms-output.s3.cern.ch.
Beside that CMS reports the status to:
vc-cms-output.s3.cern.ch port 443
WMAgent port 4080
HTCondor port 9618


From the logs and the total runtime it can be seen that the errors always happen at that point and since the VM doesn't get a 2nd subtask the whole task fails after a couple of attempts.
Of course, this is a nasty behaviour, but ATM we have to deal with it.

The important thing is to find out whether the result upload, the reporting or the request for fresh work fails.
Unfortunately the logfile doesn't tell us any details.


Wi-Fi might be a factor.
It's nice to know that your wi-fi is running at 5 GHz but this doesn't tell us anything about the connection stability and net. data rates.
A cable connection should be used whenever possible.

The upload of the 120 MB result file should be visible in the network monitoring, either on the host or at the internet router.
If this upload fails corresponding error messages appear at the VM consoles - you may look at ALT-F4, ALT-F5 ... and post them here.


Another factor could be a malware protection suite that is configured to firewall some of the communication packets but lets packets pass that are used for the VM's basic network tests.
ID: 44048 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 44049 - Posted: 2 Jan 2021, 12:18:07 UTC - in response to Message 44048.  

I am completing Atlas tasks on 2 CPUs and QuChem tasks where I am in the top 50 users against many Linux hosts with up to 128 processors. My tasks using VirtualBox are faster than most Linux hosts, with rare exceptions. They don't use VirtualBox.
Tullio
ID: 44049 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2572
Credit: 259,299,547
RAC: 107,057
Message 44050 - Posted: 2 Jan 2021, 12:57:33 UTC - in response to Message 44049.  

...and QuChem tasks where I am in the top 50 users against many Linux hosts with up to 128 processors.

That's very nice.

Nonetheless, I don't know the network requirements of QuChem. Hence I can't compare them to CMS.

Even ATLAS has other requirements than CMS, especially regarding the server side job distribution systems.
ATLAS contacts Panda while CMS contacts HTCondor and WMAgent.
ID: 44050 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 44051 - Posted: 3 Jan 2021, 7:24:40 UTC - in response to Message 44050.  

OK. But I complete also Theory tasks, that I run since it was callet Test4Theory@home, on invitation by Ben Segal, who has sent me a handwritten letter and a polo shirt. Happy New Year, Ben!
Tullio
ID: 44051 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2259
Credit: 175,149,420
RAC: 69,319
Message 44052 - Posted: 3 Jan 2021, 8:23:01 UTC - in response to Message 44051.  
Last modified: 3 Jan 2021, 8:23:23 UTC

+1
Found this thread about HT Condor from 7 Years ago:
https://lists.cs.wisc.edu/archive/htcondor-users/2013-January/msg00139.shtml
ID: 44052 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2572
Credit: 259,299,547
RAC: 107,057
Message 44053 - Posted: 3 Jan 2021, 9:34:03 UTC - in response to Message 44052.  

One of my VMs got it's 1st CMS subtask this morning 5:34:40 UTC and successfully finished it 8:50:01 UTC.
This is a runtime of >3:15.

A few seconds after the result upload the same VM got it's 2nd subtask.
Hence, I doubt the VMs are affected by a 7 year old issue that might have been a disk space error on a Windows machine rather than a bug inside a VM running Linux.


This should not initiate an OS war, it's just that - independent from the host OS - CMS always runs on the same Linux VM image.
In addition inside this VM the scientific apps are encapsulated a second time in a singularity container.
ID: 44053 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 44054 - Posted: 3 Jan 2021, 10:37:52 UTC

Frankly, I don't know much about containers. I used to run Test4Theory@home tasks on a Linux host. Now I run QuChem Linux tasks on a Windows 10 host using a wrapper. They all run well and that satisfies me.
Tullio
ID: 44054 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1838
Credit: 123,410,699
RAC: 144,119
Message 44365 - Posted: 22 Feb 2021, 9:25:27 UTC

I noticed only now that since last night, all CMS tasks fail after a few minutes with

"-152 (0xFFFFFF68) ERR_NETOPEN"
2021-02-22 08:45:28 (5768): Guest Log: [ERROR] Could not connect to Condor server on port 9618
2021-02-22 08:45:28 (5768): Guest Log: [INFO] Shutting Down.

What's the problem ?
ID: 44365 · Report as offensive     Reply Quote
Previous · 1 . . . 18 · 19 · 20 · 21 · 22 · Next

Message boards : CMS Application : CMS Tasks Failing


©2025 CERN