Message boards :
CMS Application :
CMS Tasks Failing
Message board moderation
Previous · 1 . . . 17 · 18 · 19 · 20 · 21 · 22 · Next
Author | Message |
---|---|
![]() Send message Joined: 29 Aug 05 Posts: 1071 Credit: 8,283,586 RAC: 7,784 ![]() |
|
Send message Joined: 18 Dec 15 Posts: 1838 Credit: 123,569,641 RAC: 143,613 ![]() ![]() ![]() |
Ivan, many thanks, as usual, for taking care of the problem :-) |
![]() Send message Joined: 29 Aug 05 Posts: 1071 Credit: 8,283,586 RAC: 7,784 ![]() |
|
Send message Joined: 14 Jan 10 Posts: 1439 Credit: 9,637,374 RAC: 2,040 ![]() ![]() ![]() |
I got this during startup: ![]() and after 18 minutes run time: Exit status 207 (0x000000CF) EXIT_NO_SUB_TASKS |
Send message Joined: 14 Jan 10 Posts: 1439 Credit: 9,637,374 RAC: 2,040 ![]() ![]() ![]() |
OK, with a next task resolving gitlab.cern.ch was OK and connected. cmsRun started after 13 minutes up-time. |
![]() Send message Joined: 29 Aug 05 Posts: 1071 Credit: 8,283,586 RAC: 7,784 ![]() |
|
![]() Send message Joined: 28 Sep 04 Posts: 739 Credit: 51,190,295 RAC: 49,299 ![]() ![]() ![]() |
I just had one CMS task that was preempted by Boinc after about 4 hours and removed from memory. Another CMS task was started instead. The log shows: 04:01:57.503788 Changing the VM state from 'RUNNING' to 'SUSPENDING' 04:01:57.557656 PDMR3Suspend: 53 774 707 ns run time 04:01:57.557676 Changing the VM state from 'SUSPENDING' to 'SUSPENDED' 04:01:57.557691 Console: Machine state changed to 'Paused' 04:01:57.558492 Console: Machine state changed to 'Saving' 04:01:57.561391 Changing the VM state from 'SUSPENDED' to 'SAVING' 04:02:10.555025 SSM: Footer at 0x60cdf39a (1624109978), 31 directory entries. 04:02:10.562875 SSM: Successfully saved the VM state to 'C:\ProgramData\BOINC\slots\5\boinc_ef5713cf98d25628\Snapshots\2020-07-04T11-55-04-227832500Z.sav' 04:02:10.562893 Changing the VM state from 'SAVING' to 'SUSPENDED' 04:02:10.562927 Console::powerDown(): A request to power off the VM has been issued (mMachineState=Saving, InUninit=0) 04:02:10.562944 Display::handleDisplayResize: uScreenId=0 pvVRAM=000000000a850000 w=800 h=600 bpp=32 cbLine=0xC80 flags=0x1 04:02:10.574484 VRDP: TCP server closed. 04:02:10.575348 Changing the VM state from 'SUSPENDED' to 'POWERING_OFF' Then a bunch of register values etc. The computer is using about 48 GB of memory but it is allowed to use 57 of 64 GB so that shouldn't be it. ![]() |
Send message Joined: 14 Jan 10 Posts: 1439 Credit: 9,637,374 RAC: 2,040 ![]() ![]() ![]() |
A normal run doing 10000 events and finally ended without clear reason in Computation error: https://lhcathome.cern.ch/lhcathome/result.php?resultid=279231284 This task have done only 1 job in > 12 hours elapsed time with some pauses in between. In my opinion Condor creates an error exit without reason. |
Send message Joined: 18 Dec 15 Posts: 1838 Credit: 123,569,641 RAC: 143,613 ![]() ![]() ![]() |
the tasks which I ran since yesterday on two different machines were okay. So either I was particularly lucky, or Crystal Pellet was particularly unlucky |
Send message Joined: 14 Jan 10 Posts: 1439 Credit: 9,637,374 RAC: 2,040 ![]() ![]() ![]() |
the tasks which I ran since yesterday on two different machines were okay.I see some differences: - On the 2 machines where you have CMS-tasks, the tasks are running uninterrupted, so without any pausing. - I suppose because of the elapsed time (14-17hrs), the VM's did at least 2 jobs during that time. |
Send message Joined: 14 Jan 10 Posts: 1439 Credit: 9,637,374 RAC: 2,040 ![]() ![]() ![]() |
To narrow down the possible cause of the errors I'm experiencing: 1 task doing 2 uninterrupted sequential jobs gives a success result: https://lhcathome.cern.ch/lhcathome/result.php?resultid=279268115 1 task that only did 1 uninterrupted job ends BOINCwise into an error: https://lhcathome.cern.ch/lhcathome/result.php?resultid=279314065 although the job inside the VM was successful ![]() |
![]() Send message Joined: 15 Jun 08 Posts: 2575 Credit: 259,414,855 RAC: 106,569 ![]() ![]() |
The failed task had a runtime of 12 h 40 min and a CPU time of 9 h 50 min. Since CMS subtasks usually finish within 1-2 h on that type of computer (i7-2600) I wonder what it did all the (CPU-)time. Sure the CPU throttle did not introduce unexpected issues? 2020-07-06 19:45:25 (10228): Setting CPU throttle for VM. (60%) 2020-07-06 19:52:47 (10228): Setting CPU throttle for VM. (80%) 2020-07-06 21:45:50 (10228): Setting CPU throttle for VM. (70%) |
Send message Joined: 14 Jan 10 Posts: 1439 Credit: 9,637,374 RAC: 2,040 ![]() ![]() ![]() |
computezrmle, from your post in this thread of the 1st of July: As far as I see in my statistics all goes fine if at least a 2nd subtask can be run.That i7 2600 has 8 threads with hyperthreading on. One thread was running CMS and 7 WCG Open Pandemic. With that load I'm not able to finish 2 CMS sub-jobs within the 18 hours-limit and the task will be killed. In my success example I reduced the load to 4 busy threads to be able to finish 2 sub-jobs on time. The VM CPU-throttling was on purpose to let run the VM overnight, so I could watch how it would finish the next morning and to be sure that the sub-job would pass the 12 hour mark. I don't think the VM-throttling has a negative effect, but when, that would be an avoidable error too. Like you, I think there is something wrong when only 1 sub-job has finished. In my enclosed picture you see that the sub-job finished OK (Ivan could probably see that in his DB), so there is something wrong with the job-handling by Condor/WMAgent/or what else. |
Send message Joined: 14 Jan 10 Posts: 1439 Credit: 9,637,374 RAC: 2,040 ![]() ![]() ![]() |
I started yesterday 4 CMS-tasks concurrently and after about 1 hour they were the only BOINC-tasks on the 8 threads. This way they would do at least 2 VM inside jobs. 2 tasks did 2 jobs, the other 2 tasks did 3 jobs. I had to restart BOINC once, because 1 CMS-task got the "postponed waiting 86400 sec." status. All 4 tasks finished OK, so I think, when a task has done only 1 job, it will fail BOINCwise. |
![]() Send message Joined: 29 Aug 05 Posts: 1071 Credit: 8,283,586 RAC: 7,784 ![]() |
Interruption to CMS@Home tomorrow: see https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5478&postid=43057 ![]() |
![]() Send message Joined: 29 Aug 05 Posts: 1071 Credit: 8,283,586 RAC: 7,784 ![]() |
|
![]() Send message Joined: 28 Sep 04 Posts: 739 Credit: 51,190,295 RAC: 49,299 ![]() ![]() ![]() |
All my CMS tasks for this afternoon have started to fail with error 206 (0x000000CE) EXIT_INIT_FAILURE. Here's one example: https://lhcathome.cern.ch/lhcathome/result.php?resultid=279343405 ![]() |
Send message Joined: 9 Jan 15 Posts: 151 Credit: 431,596,822 RAC: 0 ![]() ![]() |
All my CMS tasks for this afternoon have started to fail with error 206 (0x000000CE) EXIT_INIT_FAILURE. Here's one example: https://lhcathome.cern.ch/lhcathome/result.php?resultid=279343405 Same for me and most likely others to. Proxy issue and state that it could not pem file. 2020-07-19 19:27:47 (16908): Guest Log: ERROR: Couldn't read proxy from: /tmp/x509up_u0 2020-07-19 19:27:47 (16908): Guest Log: globus_credential: Error reading proxy credential 2020-07-19 19:27:47 (16908): Guest Log: globus_credential: Error reading proxy credential: Couldn't read PEM from bio 2020-07-19 19:27:47 (16908): Guest Log: OpenSSL Error: pem_lib.c:703: in library: PEM routines, function PEM_read_bio: no start line |
Send message Joined: 14 Jan 10 Posts: 1439 Credit: 9,637,374 RAC: 2,040 ![]() ![]() ![]() |
I started yesterday 4 CMS-tasks concurrently and after about 1 hour they were the only BOINC-tasks on the 8 threads.I tested this again. Able to run only 1 cmsRun during a BOINC-task and ready, uploading the result after 12 hours elapsed time. Again it failed BOINC-wise with Unknown error code https://lhcathome.cern.ch/lhcathome/result.php?resultid=282223010 |
Send message Joined: 14 Jan 10 Posts: 1439 Credit: 9,637,374 RAC: 2,040 ![]() ![]() ![]() |
I tested now with 4 tasks running not using hyperthreading and therefore able to do 2 cmsrun's during the elapsed time. Now all 4 were valid. Conclusion: Condor exits with an error when only 1 job is done during the VM-lifetime. I cannot use all 8 threads, cause CMS is not able to run 2 sub-tasks on that machine within the maximum of 18 hours elapsed. CMS could reduce the # of events from 10,000 to 5,000 or solve the Condor problem. |
©2025 CERN