Message boards : CMS Application : CMS Tasks Failing
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 18 · 19 · 20 · 21 · 22 · Next

AuthorMessage
Profile anarchic teapot
Avatar

Send message
Joined: 15 Feb 06
Posts: 67
Credit: 460,896
RAC: 0
Message 43354 - Posted: 14 Sep 2020, 9:35:28 UTC

I tried switching off CMS tasks in my Preferences...

It doesn't work.

In desperation I've stopped all work from the project. Thankfully, LHC *does* respect that (there's a number that don't).

However, if it's any help, I can conform that CMS is broken even with a clean install on a brand new computer.
sQuonk
Plague of Mice
Intel Core i3-9100 CPU@3.60 GHz, but it's doing its bit just the same.
ID: 43354 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2244
Credit: 173,902,375
RAC: 456
Message 43355 - Posted: 14 Sep 2020, 11:33:20 UTC - in response to Message 43354.  

vt-x need to be enabled in the BIOS of a Intel-PC,
also Hyper-V in Windows need to be DISABLED.
After a reboot and other Errors, please report it.
ID: 43355 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1061
Credit: 7,737,455
RAC: 298
Message 43356 - Posted: 14 Sep 2020, 12:16:55 UTC - in response to Message 43355.  

vt-x need to be enabled in the BIOS of a Intel-PC,
also Hyper-V in Windows need to be DISABLED.
After a reboot and other Errors, please report it.

Is there a good recipe for disabling Hyper-V? I tried several methods found on the Web, but still I get
Virtualization Virtualbox (6.1.12) installed, CPU does not have hardware virtualization support
in https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10653693
I thought it might have been because I had Windows Subsystem for Linux installed, but after I
removed that, Hyper-V still comes back every time I boot.
ID: 43356 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2244
Credit: 173,902,375
RAC: 456
Message 43357 - Posted: 14 Sep 2020, 13:16:17 UTC - in response to Message 43356.  

Have only one Intel(HP) and there is HYPER-V in Windows-Features not enabled all other PC are AMD (SVM for Virtualization in BIOS).
No idea myself, why Hyper-V is enabled after reboot, sorry.
ID: 43357 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2541
Credit: 254,608,838
RAC: 34,609
Message 43358 - Posted: 15 Sep 2020, 7:09:16 UTC - in response to Message 43356.  

Is there a good recipe for disabling Hyper-V?

A recent comment posted by Microsoft:
https://docs.microsoft.com/en-us/troubleshoot/windows-client/application-management/virtualization-apps-not-work-with-hyper-v
ID: 43358 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1061
Credit: 7,737,455
RAC: 298
Message 43359 - Posted: 15 Sep 2020, 12:46:26 UTC - in response to Message 43358.  

Is there a good recipe for disabling Hyper-V?

A recent comment posted by Microsoft:
https://docs.microsoft.com/en-us/troubleshoot/windows-client/application-management/virtualization-apps-not-work-with-hyper-v

Thanks. I've done most of those I think, but I'll go through it step-by-step.
ID: 43359 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 43806 - Posted: 8 Dec 2020, 18:05:44 UTC

After a long time I got a CMS task, which ended in failure. Condor ended in 10656 s. Is that right?
Tullio
ID: 43806 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1061
Credit: 7,737,455
RAC: 298
Message 43844 - Posted: 11 Dec 2020, 9:10:33 UTC - in response to Message 43806.  

After a long time I got a CMS task, which ended in failure. Condor ended in 10656 s. Is that right?
Tullio

CPU time seems credible for running one job, but unfortunately there's not enough information in your log file to say why it then failed. I'm seeing suspicions of network problems overall, but nothing concrete to put my finger on just yet.
ID: 43844 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 43851 - Posted: 11 Dec 2020, 12:33:54 UTC
Last modified: 11 Dec 2020, 12:34:36 UTC

I am runnig both Atlas, Theory and Sixtrack on this same CPU, plus QuChemPedIA@home, all using VirtualBox save SixTrack and all run well, On QuChem, using VirtualBox because it is a Linux project. I am faster than most Linux CPUs, even those with 128 processors. I am using a 6 processor Intel i5 9400F CPU. I have the rank 51 in RAC.
Tullio
ID: 43851 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 43878 - Posted: 12 Dec 2020, 15:42:18 UTC

Another CMS task failed exactly with the same message. I am now running 4 Theory tasks. I had to remove the McAfee antivirus program to run Atlas tasks and am now using Windows Defender.
Tullio
ID: 43878 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 44011 - Posted: 27 Dec 2020, 13:19:19 UTC

Condor ended after 10637 seconds. Atlas and Theory tasks all complete. QuChemPedIA@home . using VirtualBox, run perfeclttly. I am number 50 in RAC rank, although my Intel i5 CPU is far inferior to Intel i7 and AMD Ryzen Threadripper CPUs running Linux. They take a longer time without using VirtualBox.
Tullio.
ID: 44011 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2244
Credit: 173,902,375
RAC: 456
Message 44037 - Posted: 1 Jan 2021, 11:52:46 UTC - in response to Message 44011.  

Tullio,
can you ping vocms0267.cern.ch on a shell.
This Condor-Server is using Port 9618 when CMS Task is running.
ID: 44037 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 44045 - Posted: 2 Jan 2021, 6:24:24 UTC - in response to Message 44037.  
Last modified: 2 Jan 2021, 6:30:42 UTC

Pinged it from a Linux virtual machine on a Windows 10 host.
27 packets transmitted, 0 packet loss.
Thanks anyway
Tullio
wifi at 5 GHz
ID: 44045 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2541
Credit: 254,608,838
RAC: 34,609
Message 44048 - Posted: 2 Jan 2021, 8:23:58 UTC - in response to Message 44045.  

At the end of a subtask calculation CMS uploads a 120 MB result file directly from inside the VM to vc-cms-output.s3.cern.ch.
Beside that CMS reports the status to:
vc-cms-output.s3.cern.ch port 443
WMAgent port 4080
HTCondor port 9618


From the logs and the total runtime it can be seen that the errors always happen at that point and since the VM doesn't get a 2nd subtask the whole task fails after a couple of attempts.
Of course, this is a nasty behaviour, but ATM we have to deal with it.

The important thing is to find out whether the result upload, the reporting or the request for fresh work fails.
Unfortunately the logfile doesn't tell us any details.


Wi-Fi might be a factor.
It's nice to know that your wi-fi is running at 5 GHz but this doesn't tell us anything about the connection stability and net. data rates.
A cable connection should be used whenever possible.

The upload of the 120 MB result file should be visible in the network monitoring, either on the host or at the internet router.
If this upload fails corresponding error messages appear at the VM consoles - you may look at ALT-F4, ALT-F5 ... and post them here.


Another factor could be a malware protection suite that is configured to firewall some of the communication packets but lets packets pass that are used for the VM's basic network tests.
ID: 44048 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 44049 - Posted: 2 Jan 2021, 12:18:07 UTC - in response to Message 44048.  

I am completing Atlas tasks on 2 CPUs and QuChem tasks where I am in the top 50 users against many Linux hosts with up to 128 processors. My tasks using VirtualBox are faster than most Linux hosts, with rare exceptions. They don't use VirtualBox.
Tullio
ID: 44049 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2541
Credit: 254,608,838
RAC: 34,609
Message 44050 - Posted: 2 Jan 2021, 12:57:33 UTC - in response to Message 44049.  

...and QuChem tasks where I am in the top 50 users against many Linux hosts with up to 128 processors.

That's very nice.

Nonetheless, I don't know the network requirements of QuChem. Hence I can't compare them to CMS.

Even ATLAS has other requirements than CMS, especially regarding the server side job distribution systems.
ATLAS contacts Panda while CMS contacts HTCondor and WMAgent.
ID: 44050 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 44051 - Posted: 3 Jan 2021, 7:24:40 UTC - in response to Message 44050.  

OK. But I complete also Theory tasks, that I run since it was callet Test4Theory@home, on invitation by Ben Segal, who has sent me a handwritten letter and a polo shirt. Happy New Year, Ben!
Tullio
ID: 44051 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2244
Credit: 173,902,375
RAC: 456
Message 44052 - Posted: 3 Jan 2021, 8:23:01 UTC - in response to Message 44051.  
Last modified: 3 Jan 2021, 8:23:23 UTC

+1
Found this thread about HT Condor from 7 Years ago:
https://lists.cs.wisc.edu/archive/htcondor-users/2013-January/msg00139.shtml
ID: 44052 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2541
Credit: 254,608,838
RAC: 34,609
Message 44053 - Posted: 3 Jan 2021, 9:34:03 UTC - in response to Message 44052.  

One of my VMs got it's 1st CMS subtask this morning 5:34:40 UTC and successfully finished it 8:50:01 UTC.
This is a runtime of >3:15.

A few seconds after the result upload the same VM got it's 2nd subtask.
Hence, I doubt the VMs are affected by a 7 year old issue that might have been a disk space error on a Windows machine rather than a bug inside a VM running Linux.


This should not initiate an OS war, it's just that - independent from the host OS - CMS always runs on the same Linux VM image.
In addition inside this VM the scientific apps are encapsulated a second time in a singularity container.
ID: 44053 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 44054 - Posted: 3 Jan 2021, 10:37:52 UTC

Frankly, I don't know much about containers. I used to run Test4Theory@home tasks on a Linux host. Now I run QuChem Linux tasks on a Windows 10 host using a wrapper. They all run well and that satisfies me.
Tullio
ID: 44054 · Report as offensive     Reply Quote
Previous · 1 . . . 18 · 19 · 20 · 21 · 22 · Next

Message boards : CMS Application : CMS Tasks Failing


©2024 CERN