Message boards :
CMS Application :
CMS Tasks Failing
Message board moderation
Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · 12 · 13 . . . 22 · Next
Author | Message |
---|---|
Send message Joined: 2 May 07 Posts: 2244 Credit: 173,902,375 RAC: 748 |
Spring and Autumn is therefor a good time. We Volunteers would be happy if this would be solved. Ivan, we stay with you! |
Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0 |
All Atlas tasks complete and validate both on my Windows 10 PC and the SUN Linux box. All other LHC tasks fail, save SixTrack. What is the difference between Atlas tasks and other VM tasks? Tullio |
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 489 |
Spring and Autumn is therefor a good time. Thanks, I need the feedback sometimes. Especially as I need to get this into real CMS production before I retire. I'm contracted until 31/3/18; there might be funds until the end of next year. After that I may be forced back to Byron Bay. :-( (The health service is free here, not so in Oz.) |
Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,921,996 RAC: 33,985 |
... I retire ... :-( :-( :-( |
Send message Joined: 24 Oct 04 Posts: 1174 Credit: 54,887,670 RAC: 9,455 |
Spring and Autumn is therefor a good time. Moving to New South Wales? Australia is too far away for me to ever travel. Is this *semi-retired*?? (that is what I have always claimed myself) But I have lived here since 1971 so most of my travel is on the bigscreen via satellite (which ends up costing as much as actually traveling) I just might have to run a few hundred CMS for you since I was just leaving those to you over at -dev I do like that contract date since I will hit that dreaded *60* mark in January .....and start my 14th year here in a couple days (10-24-2004) Over here most people move to live by the ocean or for some hot reason down to Arizona when they retire........but I have lived by the water pretty much all 60 years so Byron Bay sounds like the place for that Ivan (humid subtropical climate) It will be warmer December down there. And it looks like not a lot of people there (I wish some where I live would move there) Here it is rain until June and occasional wind just enough to knock out the power and shut down my fleet of computers.....and everything else (yeah you would think I would have got a generator by now) I only have an ale on a rare occasion but switching from Samuel Smiths to a Fosters would be impossible (ok it does look like they have hundreds of micro brewers just like we have here) I would take care of that 20-core Xenon for you of course But let me know if you want me to run some CMS multi-cores Volunteer Mad Scientist For Life |
Send message Joined: 2 May 07 Posts: 2244 Credit: 173,902,375 RAC: 748 |
But let me know if you want me to run some CMS multi-cores Magic, you can set your preferences in -dev to get both multicore tasks(Theory and CMS). This is running for me in production very well (only single-core). |
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 489 |
Spring and Autumn is therefor a good time. Yeah, I was born there, 66 years ago. In the bush. In a banana shed! Not sure about semi-retired, but I'd go mental being fully retired. Perhaps write that SF novel that's been in my head for decades? But I have lived here since 1971 so most of my travel is on the bigscreen via satellite (which ends up costing as much as actually traveling) I realised recently that this is the longest I've worked at one place, the longest I've worked in one country (17 years UK, 13 years CH, 14 years Oz including Uni, and 4 years CA), and the longest I've lived at one address. :-( Over here most people move to live by the ocean or for some hot reason down to Arizona when they retire........but I have lived by the water pretty much all 60 years so Byron Bay sounds like the place for that Ivan (humid subtropical climate) I know what you mean about the rain, I lived for four years in Vancouver BC. I once asked a chap I was sharing a Whistler chair-lift with where he was from, "Everwet, Washington!" Surprisingly, Expo '86 was completely dry except for the last day -- which was also the day I was riding my motorcycle from Coos Bay (IIRC) to YVR on my way back from a conference in SF. I was a bit bedraggled at the border crossing... On the other hand, Byron Bay gets 100 inches per year, mostly as summer thunderstorms and the rain-depression aftermaths of cyclones (Pacific hurricanes) that have run out of steam on their way south down the Queensland coast. The humidity gets me ever since I moved away to Uni in Canberra -- I ended up not going home for Christmas while I was doing my PhD; 42 C and 20% humidity was preferable to 32 C and 90% humidity. |
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 489 |
I would take care of that 20-core Xenon for you of course Unfortunately, that machine and its twin (they each take up a half-width slot in a 2U rack chassis) belong to work, not to me. They both run SETI@Home in the background as a "stress test". :-) So far, everyone seems happy with that. Running CMS@Home on brphab and another 12-core machine is easier to justify, of course, given my involvement with this project. |
Send message Joined: 5 Nov 15 Posts: 144 Credit: 6,301,268 RAC: 0 |
All Atlas tasks complete and validate both on my Windows 10 PC and the SUN Linux box. All other LHC tasks fail, save SixTrack. What is the difference between Atlas tasks and other VM tasks? If you watch the VM's of each progress till they are crunching, ATLAS seems to be on a different development path while CMS, Theory and LHCb seem to be in near identical scripts and setups. |
Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0 |
That is why I am running only Atlas@home and SixTrack, when available. I abort all the rest. Tullio |
Send message Joined: 2 May 07 Posts: 2244 Credit: 173,902,375 RAC: 748 |
Have CMS and Theory selected in preferences and they are running on a Server very well without problems! |
Send message Joined: 5 Nov 15 Posts: 144 Credit: 6,301,268 RAC: 0 |
For the last couple weeks I've been stress testing LHC@Home work on this server by overloading BOINC with high disk usage tasks (Pogs, Rosetta) running alongside the various VM-based LHC WU styles and CMS has the highest failure rate. "Unable to mount root" is the typical failure. If Theory, CMS, LHCb and ATLAS are starved for CPU cycles by running competing applications that use most of the available threads (simulating game playing. etc), again, CMS has the highest failure rates on this server. The only way to get 12 CMS tasks to eventually all suspend successfully on the hardware RAID0 3 drive partition, dedicated to BOINC, is to accept the first suspend failures and defragment the partition, which will have a 95%+ fragmentation. Subsequent suspension attempts will almost always succeed once drive is defragmented. This says something about how BOINC moves the large WU VDI's into the slots when a whole batch of WU are started. Maybe there is a BOINC setting to force it to move one slot at a time instead of multi-tasking that procedure? (This project is certainly designed for SSD drives, which are not in my budget) Could the project be modified for WU's to use link clone to the base VDI for each slot and save on space, fragmentations and setup times? |
Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,921,996 RAC: 33,985 |
again, same problem with a task this morning: it errored out after a few minutes: 2017-10-27 09:29:48 (6820): Guest Log: [DEBUG] Testing connection to Condor server on port 9618 2017-10-27 09:30:08 (6820): VM Completion File Detected. 2017-10-27 09:30:08 (6820): VM Completion Message: Could not connect to Condor server on port 9618 |
Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,921,996 RAC: 33,985 |
next probably Condor-related problem this afternoon: 2017-10-28 14:12:41 (3616): Guest Log: [DEBUG] HTCondor ping 2017-10-28 14:14:21 (3616): Guest Log: [DEBUG] 1 2017-10-28 14:14:21 (3616): Guest Log: [DEBUG] DC_NOP failed! 2017-10-28 14:14:21 (3616): Guest Log: SECMAN:2007:Failed to end classad message. 2017-10-28 14:14:21 (3616): Guest Log: 10/28/17 14:13:23 recognized DC_NOP as command name, using command 60011. 2017-10-28 14:14:21 (3616): Guest Log: 10/28/17 14:15:03 SECMAN: no classad from server, failing 2017-10-28 14:14:21 (3616): Guest Log: [ERROR] Could not ping HTCondor. 2017-10-28 14:14:21 (3616): Guest Log: [INFO] Shutting Down. what is this all about? |
Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,921,996 RAC: 33,985 |
This morning I noticed a CMS task which had been running for more than 16 hours, but the Windows Task Manager showed no CPU usage. I terminated the task, and from stderr I could see that total CPU usage was only 75 secs. For more information see here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=163118321 very strange to me sounds the "Hypervisor System log" within the stderr (in the lower part) - I had never seen something like this before. Can anyone tell me what was going wrong? |
Send message Joined: 19 Feb 08 Posts: 708 Credit: 4,336,250 RAC: 0 |
They all fail on my Windows 10 PC while Atlas tasks run perfectly on the same machine. Nobody has given me any explanation. Tullio |
Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,921,996 RAC: 33,985 |
something seems strange with the CMS tasks lately. 2 days ago, for example, I had a few cases where tasks errored out after several hours, with -1073740791 (0xC0000409) STATUS_STACK_BUFFER_OVERRUN shown under "final status". More Information on such a task here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=163021052 |
Send message Joined: 14 Jan 10 Posts: 1422 Credit: 9,484,585 RAC: 2,078 |
. After Requesting an X509 credential from LHC@home there is no fast benchmark and no HTCondor ping, so somehow the request did not get an answer. The job should have been aborted. |
Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,921,996 RAC: 33,985 |
hm, so it seems that there may indeed be some kind of problem with the Condor Server - which occurs not too often, but once in a while Once more, yesterday evening I had this problem: 2017-10-30 21:11:32 (3808): Guest Log: [DEBUG] Testing connection to Condor server on port 9618 2017-10-30 21:11:52 (3808): VM Completion File Detected. 2017-10-30 21:11:52 (3808): VM Completion Message: Could not connect to Condor server on port 9618 So, obviously there is still some kind of issue with the Condor server. |
Send message Joined: 5 Nov 15 Posts: 144 Credit: 6,301,268 RAC: 0 |
again, same problem with a task this morning: it errored out after a few minutes: This error and the other error you posted are occurring in Theory sims also. I posted the variety of errors from my last 3 days of Theory runs on your other thread https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4496&postid=32939#32939. |
©2024 CERN