Message boards : CMS Application : CMS Tasks Failing
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · 12 · 13 . . . 22 · Next

AuthorMessage
maeax

Send message
Joined: 2 May 07
Posts: 2244
Credit: 173,902,375
RAC: 307
Message 32865 - Posted: 21 Oct 2017, 6:10:43 UTC

Spring and Autumn is therefor a good time.
We Volunteers would be happy if this would be solved.
Ivan, we stay with you!
ID: 32865 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 32866 - Posted: 21 Oct 2017, 7:01:09 UTC

All Atlas tasks complete and validate both on my Windows 10 PC and the SUN Linux box. All other LHC tasks fail, save SixTrack. What is the difference between Atlas tasks and other VM tasks?
Tullio
ID: 32866 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1061
Credit: 7,737,455
RAC: 201
Message 32870 - Posted: 21 Oct 2017, 22:02:40 UTC - in response to Message 32865.  

Spring and Autumn is therefor a good time.
We Volunteers would be happy if this would be solved.
Ivan, we stay with you!

Thanks, I need the feedback sometimes. Especially as I need to get this into real CMS production before I retire. I'm contracted until 31/3/18; there might be funds until the end of next year. After that I may be forced back to Byron Bay. :-( (The health service is free here, not so in Oz.)
ID: 32870 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1821
Credit: 118,983,773
RAC: 18,214
Message 32871 - Posted: 22 Oct 2017, 4:34:17 UTC - in response to Message 32870.  

... I retire ...

:-( :-( :-(
ID: 32871 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1177
Credit: 54,887,670
RAC: 3,877
Message 32872 - Posted: 22 Oct 2017, 7:25:48 UTC - in response to Message 32870.  
Last modified: 22 Oct 2017, 7:26:43 UTC

Spring and Autumn is therefor a good time.
We Volunteers would be happy if this would be solved.
Ivan, we stay with you!

Thanks, I need the feedback sometimes. Especially as I need to get this into real CMS production before I retire. I'm contracted until 31/3/18; there might be funds until the end of next year. After that I may be forced back to Byron Bay. :-( (The health service is free here, not so in Oz.)


Moving to New South Wales?
Australia is too far away for me to ever travel.
Is this *semi-retired*?? (that is what I have always claimed myself)

But I have lived here since 1971 so most of my travel is on the bigscreen via satellite (which ends up costing as much as actually traveling)

I just might have to run a few hundred CMS for you since I was just leaving those to you over at -dev

I do like that contract date since I will hit that dreaded *60* mark in January .....and start my 14th year here in a couple days (10-24-2004)

Over here most people move to live by the ocean or for some hot reason down to Arizona when they retire........but I have lived by the water pretty much all 60 years so Byron Bay sounds like the place for that Ivan (humid subtropical climate)

It will be warmer December down there.
And it looks like not a lot of people there (I wish some where I live would move there)

Here it is rain until June and occasional wind just enough to knock out the power and shut down my fleet of computers.....and everything else (yeah you would think I would have got a generator by now)

I only have an ale on a rare occasion but switching from Samuel Smiths to a Fosters would be impossible (ok it does look like they have hundreds of micro brewers just like we have here)
I would take care of that 20-core Xenon for you of course

But let me know if you want me to run some CMS multi-cores
Volunteer Mad Scientist For Life
ID: 32872 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2244
Credit: 173,902,375
RAC: 307
Message 32873 - Posted: 22 Oct 2017, 9:59:17 UTC - in response to Message 32872.  
Last modified: 22 Oct 2017, 9:59:35 UTC

But let me know if you want me to run some CMS multi-cores

Magic,

you can set your preferences in -dev to get both multicore tasks(Theory and CMS).
This is running for me in production very well (only single-core).
ID: 32873 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1061
Credit: 7,737,455
RAC: 201
Message 32874 - Posted: 22 Oct 2017, 13:49:38 UTC - in response to Message 32872.  

Spring and Autumn is therefor a good time.
We Volunteers would be happy if this would be solved.
Ivan, we stay with you!

Thanks, I need the feedback sometimes. Especially as I need to get this into real CMS production before I retire. I'm contracted until 31/3/18; there might be funds until the end of next year. After that I may be forced back to Byron Bay. :-( (The health service is free here, not so in Oz.)


Moving to New South Wales?
Australia is too far away for me to ever travel.
Is this *semi-retired*?? (that is what I have always claimed myself)

Yeah, I was born there, 66 years ago. In the bush. In a banana shed!
Not sure about semi-retired, but I'd go mental being fully retired. Perhaps write that SF novel that's been in my head for decades?

But I have lived here since 1971 so most of my travel is on the bigscreen via satellite (which ends up costing as much as actually traveling)

I do like that contract date since I will hit that dreaded *60* mark in January .....and start my 14th year here in a couple days (10-24-2004)

I realised recently that this is the longest I've worked at one place, the longest I've worked in one country (17 years UK, 13 years CH, 14 years Oz including Uni, and 4 years CA), and the longest I've lived at one address. :-(

Over here most people move to live by the ocean or for some hot reason down to Arizona when they retire........but I have lived by the water pretty much all 60 years so Byron Bay sounds like the place for that Ivan (humid subtropical climate)

It will be warmer December down there.
And it looks like not a lot of people there (I wish some where I live would move there)

Here it is rain until June and occasional wind just enough to knock out the power and shut down my fleet of computers.....and everything else (yeah you would think I would have got a generator by now)

I know what you mean about the rain, I lived for four years in Vancouver BC. I once asked a chap I was sharing a Whistler chair-lift with where he was from, "Everwet, Washington!" Surprisingly, Expo '86 was completely dry except for the last day -- which was also the day I was riding my motorcycle from Coos Bay (IIRC) to YVR on my way back from a conference in SF. I was a bit bedraggled at the border crossing...
On the other hand, Byron Bay gets 100 inches per year, mostly as summer thunderstorms and the rain-depression aftermaths of cyclones (Pacific hurricanes) that have run out of steam on their way south down the Queensland coast. The humidity gets me ever since I moved away to Uni in Canberra -- I ended up not going home for Christmas while I was doing my PhD; 42 C and 20% humidity was preferable to 32 C and 90% humidity.
ID: 32874 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1061
Credit: 7,737,455
RAC: 201
Message 32876 - Posted: 22 Oct 2017, 15:41:15 UTC - in response to Message 32872.  

I would take care of that 20-core Xenon for you of course

Unfortunately, that machine and its twin (they each take up a half-width slot in a 2U rack chassis) belong to work, not to me. They both run SETI@Home in the background as a "stress test". :-) So far, everyone seems happy with that. Running CMS@Home on brphab and another 12-core machine is easier to justify, of course, given my involvement with this project.
ID: 32876 · Report as offensive     Reply Quote
marmot
Avatar

Send message
Joined: 5 Nov 15
Posts: 144
Credit: 6,301,268
RAC: 0
Message 32892 - Posted: 24 Oct 2017, 0:03:54 UTC - in response to Message 32866.  

All Atlas tasks complete and validate both on my Windows 10 PC and the SUN Linux box. All other LHC tasks fail, save SixTrack. What is the difference between Atlas tasks and other VM tasks?
Tullio


If you watch the VM's of each progress till they are crunching, ATLAS seems to be on a different development path while CMS, Theory and LHCb seem to be in near identical scripts and setups.
ID: 32892 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 32894 - Posted: 24 Oct 2017, 7:38:25 UTC - in response to Message 32892.  


If you watch the VM's of each progress till they are crunching, ATLAS seems to be on a different development path while CMS, Theory and LHCb seem to be in near identical scripts and setups.

That is why I am running only Atlas@home and SixTrack, when available. I abort all the rest.
Tullio
ID: 32894 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2244
Credit: 173,902,375
RAC: 307
Message 32897 - Posted: 24 Oct 2017, 11:00:55 UTC

Have CMS and Theory selected in preferences and they are running on a Server very well without problems!
ID: 32897 · Report as offensive     Reply Quote
marmot
Avatar

Send message
Joined: 5 Nov 15
Posts: 144
Credit: 6,301,268
RAC: 0
Message 32907 - Posted: 26 Oct 2017, 5:28:31 UTC

For the last couple weeks I've been stress testing LHC@Home work on this server by overloading BOINC with high disk usage tasks (Pogs, Rosetta) running alongside the various VM-based LHC WU styles and CMS has the highest failure rate.
"Unable to mount root" is the typical failure.

If Theory, CMS, LHCb and ATLAS are starved for CPU cycles by running competing applications that use most of the available threads (simulating game playing. etc), again, CMS has the highest failure rates on this server.

The only way to get 12 CMS tasks to eventually all suspend successfully on the hardware RAID0 3 drive partition, dedicated to BOINC, is to accept the first suspend failures and defragment the partition, which will have a 95%+ fragmentation.
Subsequent suspension attempts will almost always succeed once drive is defragmented.
This says something about how BOINC moves the large WU VDI's into the slots when a whole batch of WU are started. Maybe there is a BOINC setting to force it to move one slot at a time instead of multi-tasking that procedure? (This project is certainly designed for SSD drives, which are not in my budget)

Could the project be modified for WU's to use link clone to the base VDI for each slot and save on space, fragmentations and setup times?
ID: 32907 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1821
Credit: 118,983,773
RAC: 18,214
Message 32913 - Posted: 27 Oct 2017, 15:56:40 UTC

again, same problem with a task this morning: it errored out after a few minutes:

2017-10-27 09:29:48 (6820): Guest Log: [DEBUG] Testing connection to Condor server on port 9618
2017-10-27 09:30:08 (6820): VM Completion File Detected.
2017-10-27 09:30:08 (6820): VM Completion Message: Could not connect to Condor server on port 9618
ID: 32913 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1821
Credit: 118,983,773
RAC: 18,214
Message 32920 - Posted: 28 Oct 2017, 20:10:07 UTC
Last modified: 28 Oct 2017, 20:10:25 UTC

next probably Condor-related problem this afternoon:

2017-10-28 14:12:41 (3616): Guest Log: [DEBUG] HTCondor ping
2017-10-28 14:14:21 (3616): Guest Log: [DEBUG] 1
2017-10-28 14:14:21 (3616): Guest Log: [DEBUG] DC_NOP failed!
2017-10-28 14:14:21 (3616): Guest Log: SECMAN:2007:Failed to end classad message.
2017-10-28 14:14:21 (3616): Guest Log: 10/28/17 14:13:23 recognized DC_NOP as command name, using command 60011.
2017-10-28 14:14:21 (3616): Guest Log: 10/28/17 14:15:03 SECMAN: no classad from server, failing
2017-10-28 14:14:21 (3616): Guest Log: [ERROR] Could not ping HTCondor.
2017-10-28 14:14:21 (3616): Guest Log: [INFO] Shutting Down.


what is this all about?
ID: 32920 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1821
Credit: 118,983,773
RAC: 18,214
Message 32929 - Posted: 30 Oct 2017, 6:49:51 UTC

This morning I noticed a CMS task which had been running for more than 16 hours, but the Windows Task Manager showed no CPU usage.

I terminated the task, and from stderr I could see that total CPU usage was only 75 secs.
For more information see here:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=163118321

very strange to me sounds the "Hypervisor System log" within the stderr (in the lower part) - I had never seen something like this before.

Can anyone tell me what was going wrong?
ID: 32929 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 32930 - Posted: 30 Oct 2017, 7:36:52 UTC

They all fail on my Windows 10 PC while Atlas tasks run perfectly on the same machine. Nobody has given me any explanation.
Tullio
ID: 32930 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1821
Credit: 118,983,773
RAC: 18,214
Message 32931 - Posted: 30 Oct 2017, 8:12:10 UTC

something seems strange with the CMS tasks lately.

2 days ago, for example, I had a few cases where tasks errored out after several hours, with

-1073740791 (0xC0000409) STATUS_STACK_BUFFER_OVERRUN

shown under "final status".

More Information on such a task here:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=163021052
ID: 32931 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1422
Credit: 9,484,585
RAC: 852
Message 32932 - Posted: 30 Oct 2017, 8:49:37 UTC - in response to Message 32929.  

.
.
Can anyone tell me what was going wrong?

After Requesting an X509 credential from LHC@home
there is no fast benchmark and no HTCondor ping,
so somehow the request did not get an answer.
The job should have been aborted.
ID: 32932 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1821
Credit: 118,983,773
RAC: 18,214
Message 32938 - Posted: 31 Oct 2017, 6:30:54 UTC - in response to Message 32864.  

hm, so it seems that there may indeed be some kind of problem with the Condor Server - which occurs not too often, but once in a while


Ivan wrote:

There is, still, a very big [authentication] problem with the Condor server. However, Volunteer jobs should not be communicating with it.
tl;dr: what communicates with Condor is the log-merge processes, and these should only run on CMS resources. If they try to run on Volunteer hosts, we really need to look into it.
We are trying to solve these remaining problems, but the scattered and disparate nature of the people who need to be involved is a drawback. Northern hemisphere summer was a problem, due to holidays. I'd like it to be fixed soon but, you know, winter and Christmas...


Once more, yesterday evening I had this problem:

2017-10-30 21:11:32 (3808): Guest Log: [DEBUG] Testing connection to Condor server on port 9618
2017-10-30 21:11:52 (3808): VM Completion File Detected.
2017-10-30 21:11:52 (3808): VM Completion Message: Could not connect to Condor server on port 9618

So, obviously there is still some kind of issue with the Condor server.
ID: 32938 · Report as offensive     Reply Quote
marmot
Avatar

Send message
Joined: 5 Nov 15
Posts: 144
Credit: 6,301,268
RAC: 0
Message 32944 - Posted: 31 Oct 2017, 7:33:13 UTC - in response to Message 32913.  

again, same problem with a task this morning: it errored out after a few minutes:

2017-10-27 09:29:48 (6820): Guest Log: [DEBUG] Testing connection to Condor server on port 9618
2017-10-27 09:30:08 (6820): VM Completion File Detected.
2017-10-27 09:30:08 (6820): VM Completion Message: Could not connect to Condor server on port 9618


This error and the other error you posted are occurring in Theory sims also.
I posted the variety of errors from my last 3 days of Theory runs on your other thread https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4496&postid=32939#32939.
ID: 32944 · Report as offensive     Reply Quote
Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · 12 · 13 . . . 22 · Next

Message boards : CMS Application : CMS Tasks Failing


©2024 CERN