Message boards : CMS Application : CMS Tasks Failing
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 17 · 18 · 19 · 20 · 21 · 22 · Next

AuthorMessage
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 997
Credit: 6,264,307
RAC: 71
Message 42940 - Posted: 1 Jul 2020, 17:56:35 UTC - in response to Message 42938.  

ID: 42940 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,369,175
RAC: 102,021
Message 42941 - Posted: 1 Jul 2020, 18:04:10 UTC - in response to Message 42940.  

Ivan, many thanks, as usual, for taking care of the problem :-)
ID: 42941 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 997
Credit: 6,264,307
RAC: 71
Message 42948 - Posted: 2 Jul 2020, 19:13:53 UTC - in response to Message 42940.  

ID: 42948 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 42950 - Posted: 3 Jul 2020, 5:44:01 UTC - in response to Message 42948.  

I got this during startup:


and after 18 minutes run time: Exit status 207 (0x000000CF) EXIT_NO_SUB_TASKS
ID: 42950 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 42954 - Posted: 3 Jul 2020, 13:08:31 UTC - in response to Message 42950.  

OK, with a next task resolving gitlab.cern.ch was OK and connected. cmsRun started after 13 minutes up-time.
ID: 42954 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 997
Credit: 6,264,307
RAC: 71
Message 42963 - Posted: 4 Jul 2020, 11:54:37 UTC - in response to Message 42950.  

I got this during startup:


and after 18 minutes run time: Exit status 207 (0x000000CF) EXIT_NO_SUB_TASKS

If I recall correctly, there was a message yesterday at CERN that there was a glitch at gitlab.
ID: 42963 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 674
Credit: 43,151,503
RAC: 15,790
Message 42964 - Posted: 4 Jul 2020, 12:15:43 UTC

I just had one CMS task that was preempted by Boinc after about 4 hours and removed from memory. Another CMS task was started instead. The log shows:
04:01:57.503788 Changing the VM state from 'RUNNING' to 'SUSPENDING'
04:01:57.557656 PDMR3Suspend: 53 774 707 ns run time
04:01:57.557676 Changing the VM state from 'SUSPENDING' to 'SUSPENDED'
04:01:57.557691 Console: Machine state changed to 'Paused'
04:01:57.558492 Console: Machine state changed to 'Saving'
04:01:57.561391 Changing the VM state from 'SUSPENDED' to 'SAVING'
04:02:10.555025 SSM: Footer at 0x60cdf39a (1624109978), 31 directory entries.
04:02:10.562875 SSM: Successfully saved the VM state to 'C:\ProgramData\BOINC\slots\5\boinc_ef5713cf98d25628\Snapshots\2020-07-04T11-55-04-227832500Z.sav'
04:02:10.562893 Changing the VM state from 'SAVING' to 'SUSPENDED'
04:02:10.562927 Console::powerDown(): A request to power off the VM has been issued (mMachineState=Saving, InUninit=0)
04:02:10.562944 Display::handleDisplayResize: uScreenId=0 pvVRAM=000000000a850000 w=800 h=600 bpp=32 cbLine=0xC80 flags=0x1
04:02:10.574484 VRDP: TCP server closed.
04:02:10.575348 Changing the VM state from 'SUSPENDED' to 'POWERING_OFF'

Then a bunch of register values etc.

The computer is using about 48 GB of memory but it is allowed to use 57 of 64 GB so that shouldn't be it.
ID: 42964 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 42965 - Posted: 4 Jul 2020, 20:53:17 UTC

A normal run doing 10000 events and finally ended without clear reason in Computation error:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=279231284

This task have done only 1 job in > 12 hours elapsed time with some pauses in between.
In my opinion Condor creates an error exit without reason.
ID: 42965 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,369,175
RAC: 102,021
Message 42966 - Posted: 5 Jul 2020, 5:00:06 UTC

the tasks which I ran since yesterday on two different machines were okay.
So either I was particularly lucky, or Crystal Pellet was particularly unlucky
ID: 42966 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 42967 - Posted: 5 Jul 2020, 7:06:06 UTC - in response to Message 42966.  

the tasks which I ran since yesterday on two different machines were okay.
So either I was particularly lucky, or Crystal Pellet was particularly unlucky
I see some differences:
- On the 2 machines where you have CMS-tasks, the tasks are running uninterrupted, so without any pausing.
- I suppose because of the elapsed time (14-17hrs), the VM's did at least 2 jobs during that time.
ID: 42967 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 42975 - Posted: 7 Jul 2020, 6:35:29 UTC

To narrow down the possible cause of the errors I'm experiencing:

1 task doing 2 uninterrupted sequential jobs gives a success result: https://lhcathome.cern.ch/lhcathome/result.php?resultid=279268115

1 task that only did 1 uninterrupted job ends BOINCwise into an error: https://lhcathome.cern.ch/lhcathome/result.php?resultid=279314065
although the job inside the VM was successful
ID: 42975 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,928,590
RAC: 137,661
Message 42976 - Posted: 7 Jul 2020, 7:09:50 UTC - in response to Message 42975.  

The failed task had a runtime of 12 h 40 min and a CPU time of 9 h 50 min.
Since CMS subtasks usually finish within 1-2 h on that type of computer (i7-2600) I wonder what it did all the (CPU-)time.

Sure the CPU throttle did not introduce unexpected issues?
2020-07-06 19:45:25 (10228): Setting CPU throttle for VM. (60%)
2020-07-06 19:52:47 (10228): Setting CPU throttle for VM. (80%)
2020-07-06 21:45:50 (10228): Setting CPU throttle for VM. (70%)
ID: 42976 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 42977 - Posted: 7 Jul 2020, 9:34:20 UTC - in response to Message 42976.  

computezrmle, from your post in this thread of the 1st of July:
As far as I see in my statistics all goes fine if at least a 2nd subtask can be run.
This mess is older than the big bang and hidden deep in the interaction between htcondor, wmagent and CMS scripts and need to be fixed by the developers.
That i7 2600 has 8 threads with hyperthreading on. One thread was running CMS and 7 WCG Open Pandemic.
With that load I'm not able to finish 2 CMS sub-jobs within the 18 hours-limit and the task will be killed.
In my success example I reduced the load to 4 busy threads to be able to finish 2 sub-jobs on time.
The VM CPU-throttling was on purpose to let run the VM overnight, so I could watch how it would finish the next morning and to be sure that the sub-job would pass the 12 hour mark.
I don't think the VM-throttling has a negative effect, but when, that would be an avoidable error too.
Like you, I think there is something wrong when only 1 sub-job has finished.
In my enclosed picture you see that the sub-job finished OK (Ivan could probably see that in his DB), so there is something wrong with the job-handling by Condor/WMAgent/or what else.
ID: 42977 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 42981 - Posted: 8 Jul 2020, 11:22:11 UTC

I started yesterday 4 CMS-tasks concurrently and after about 1 hour they were the only BOINC-tasks on the 8 threads.
This way they would do at least 2 VM inside jobs. 2 tasks did 2 jobs, the other 2 tasks did 3 jobs.
I had to restart BOINC once, because 1 CMS-task got the "postponed waiting 86400 sec." status.

All 4 tasks finished OK, so I think, when a task has done only 1 job, it will fail BOINCwise.
ID: 42981 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 997
Credit: 6,264,307
RAC: 71
Message 43058 - Posted: 14 Jul 2020, 13:21:47 UTC

ID: 43058 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 997
Credit: 6,264,307
RAC: 71
Message 43065 - Posted: 15 Jul 2020, 15:08:47 UTC - in response to Message 43058.  

We are running again.
ID: 43065 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 674
Credit: 43,151,503
RAC: 15,790
Message 43087 - Posted: 19 Jul 2020, 18:14:00 UTC

All my CMS tasks for this afternoon have started to fail with error 206 (0x000000CE) EXIT_INIT_FAILURE. Here's one example: https://lhcathome.cern.ch/lhcathome/result.php?resultid=279343405
ID: 43087 · Report as offensive     Reply Quote
Greger

Send message
Joined: 9 Jan 15
Posts: 151
Credit: 431,596,822
RAC: 0
Message 43089 - Posted: 19 Jul 2020, 19:29:33 UTC - in response to Message 43087.  

All my CMS tasks for this afternoon have started to fail with error 206 (0x000000CE) EXIT_INIT_FAILURE. Here's one example: https://lhcathome.cern.ch/lhcathome/result.php?resultid=279343405


Same for me and most likely others to. Proxy issue and state that it could not pem file.

2020-07-19 19:27:47 (16908): Guest Log: ERROR: Couldn't read proxy from: /tmp/x509up_u0

2020-07-19 19:27:47 (16908): Guest Log:        globus_credential: Error reading proxy credential

2020-07-19 19:27:47 (16908): Guest Log: globus_credential: Error reading proxy credential: Couldn't read PEM from bio

2020-07-19 19:27:47 (16908): Guest Log: OpenSSL Error: pem_lib.c:703: in library: PEM routines, function PEM_read_bio: no start line
ID: 43089 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 43302 - Posted: 30 Aug 2020, 19:21:53 UTC - in response to Message 42981.  

I started yesterday 4 CMS-tasks concurrently and after about 1 hour they were the only BOINC-tasks on the 8 threads.
This way they would do at least 2 VM inside jobs. 2 tasks did 2 jobs, the other 2 tasks did 3 jobs.
I had to restart BOINC once, because 1 CMS-task got the "postponed waiting 86400 sec." status.


All 4 tasks finished OK, so I think, when a task has done only 1 job, it will fail BOINCwise.
I tested this again. Able to run only 1 cmsRun during a BOINC-task and ready, uploading the result after 12 hours elapsed time.
Again it failed BOINC-wise with Unknown error code https://lhcathome.cern.ch/lhcathome/result.php?resultid=282223010
ID: 43302 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 43323 - Posted: 9 Sep 2020, 9:26:33 UTC - in response to Message 43302.  

I tested now with 4 tasks running not using hyperthreading and therefore able to do 2 cmsrun's during the elapsed time.
Now all 4 were valid.

Conclusion: Condor exits with an error when only 1 job is done during the VM-lifetime.

I cannot use all 8 threads, cause CMS is not able to run 2 sub-tasks on that machine within the maximum of 18 hours elapsed.
CMS could reduce the # of events from 10,000 to 5,000 or solve the Condor problem.
ID: 43323 · Report as offensive     Reply Quote
Previous · 1 . . . 17 · 18 · 19 · 20 · 21 · 22 · Next

Message boards : CMS Application : CMS Tasks Failing


©2024 CERN