CMS Tasks Failing

Author	Message
ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1090 Credit: 9,319,923 RAC: 1,403	Message 42940 - Posted: 1 Jul 2020, 17:56:35 UTC - in response to Message 42938. Please see https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5469&postid=42939 ID: 42940 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1875 Credit: 137,788,231 RAC: 65,624	Message 42941 - Posted: 1 Jul 2020, 18:04:10 UTC - in response to Message 42940. Ivan, many thanks, as usual, for taking care of the problem :-) ID: 42941 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1090 Credit: 9,319,923 RAC: 1,403	Message 42948 - Posted: 2 Jul 2020, 19:13:53 UTC - in response to Message 42940. O-kay! See https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5469&postid=42947 ID: 42948 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1449 Credit: 9,734,540 RAC: 602	Message 42950 - Posted: 3 Jul 2020, 5:44:01 UTC - in response to Message 42948. I got this during startup: and after 18 minutes run time: Exit status 207 (0x000000CF) EXIT_NO_SUB_TASKS ID: 42950 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1449 Credit: 9,734,540 RAC: 602	Message 42954 - Posted: 3 Jul 2020, 13:08:31 UTC - in response to Message 42950. OK, with a next task resolving gitlab.cern.ch was OK and connected. cmsRun started after 13 minutes up-time. ID: 42954 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1090 Credit: 9,319,923 RAC: 1,403	Message 42963 - Posted: 4 Jul 2020, 11:54:37 UTC - in response to Message 42950. I got this during startup: and after 18 minutes run time: Exit status 207 (0x000000CF) EXIT_NO_SUB_TASKS If I recall correctly, there was a message yesterday at CERN that there was a glitch at gitlab. ID: 42963 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 764 Credit: 56,553,470 RAC: 27,082	Message 42964 - Posted: 4 Jul 2020, 12:15:43 UTC I just had one CMS task that was preempted by Boinc after about 4 hours and removed from memory. Another CMS task was started instead. The log shows: 04:01:57.503788 Changing the VM state from 'RUNNING' to 'SUSPENDING' 04:01:57.557656 PDMR3Suspend: 53 774 707 ns run time 04:01:57.557676 Changing the VM state from 'SUSPENDING' to 'SUSPENDED' 04:01:57.557691 Console: Machine state changed to 'Paused' 04:01:57.558492 Console: Machine state changed to 'Saving' 04:01:57.561391 Changing the VM state from 'SUSPENDED' to 'SAVING' 04:02:10.555025 SSM: Footer at 0x60cdf39a (1624109978), 31 directory entries. 04:02:10.562875 SSM: Successfully saved the VM state to 'C:\ProgramData\BOINC\slots\5\boinc_ef5713cf98d25628\Snapshots\2020-07-04T11-55-04-227832500Z.sav' 04:02:10.562893 Changing the VM state from 'SAVING' to 'SUSPENDED' 04:02:10.562927 Console::powerDown(): A request to power off the VM has been issued (mMachineState=Saving, InUninit=0) 04:02:10.562944 Display::handleDisplayResize: uScreenId=0 pvVRAM=000000000a850000 w=800 h=600 bpp=32 cbLine=0xC80 flags=0x1 04:02:10.574484 VRDP: TCP server closed. 04:02:10.575348 Changing the VM state from 'SUSPENDED' to 'POWERING_OFF' Then a bunch of register values etc. The computer is using about 48 GB of memory but it is allowed to use 57 of 64 GB so that shouldn't be it. ID: 42964 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1449 Credit: 9,734,540 RAC: 602	Message 42965 - Posted: 4 Jul 2020, 20:53:17 UTC A normal run doing 10000 events and finally ended without clear reason in Computation error: https://lhcathome.cern.ch/lhcathome/result.php?resultid=279231284 This task have done only 1 job in > 12 hours elapsed time with some pauses in between. In my opinion Condor creates an error exit without reason. ID: 42965 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1875 Credit: 137,788,231 RAC: 65,624	Message 42966 - Posted: 5 Jul 2020, 5:00:06 UTC the tasks which I ran since yesterday on two different machines were okay. So either I was particularly lucky, or Crystal Pellet was particularly unlucky ID: 42966 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1449 Credit: 9,734,540 RAC: 602	Message 42967 - Posted: 5 Jul 2020, 7:06:06 UTC - in response to Message 42966. the tasks which I ran since yesterday on two different machines were okay. So either I was particularly lucky, or Crystal Pellet was particularly unlucky I see some differences: - On the 2 machines where you have CMS-tasks, the tasks are running uninterrupted, so without any pausing. - I suppose because of the elapsed time (14-17hrs), the VM's did at least 2 jobs during that time. ID: 42967 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1449 Credit: 9,734,540 RAC: 602	Message 42975 - Posted: 7 Jul 2020, 6:35:29 UTC To narrow down the possible cause of the errors I'm experiencing: 1 task doing 2 uninterrupted sequential jobs gives a success result: https://lhcathome.cern.ch/lhcathome/result.php?resultid=279268115 1 task that only did 1 uninterrupted job ends BOINCwise into an error: https://lhcathome.cern.ch/lhcathome/result.php?resultid=279314065 although the job inside the VM was successful ID: 42975 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2636 Credit: 274,768,907 RAC: 118,676	Message 42976 - Posted: 7 Jul 2020, 7:09:50 UTC - in response to Message 42975. The failed task had a runtime of 12 h 40 min and a CPU time of 9 h 50 min. Since CMS subtasks usually finish within 1-2 h on that type of computer (i7-2600) I wonder what it did all the (CPU-)time. Sure the CPU throttle did not introduce unexpected issues? 2020-07-06 19:45:25 (10228): Setting CPU throttle for VM. (60%) 2020-07-06 19:52:47 (10228): Setting CPU throttle for VM. (80%) 2020-07-06 21:45:50 (10228): Setting CPU throttle for VM. (70%) ID: 42976 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1449 Credit: 9,734,540 RAC: 602	Message 42977 - Posted: 7 Jul 2020, 9:34:20 UTC - in response to Message 42976. computezrmle, from your post in this thread of the 1st of July: As far as I see in my statistics all goes fine if at least a 2nd subtask can be run. This mess is older than the big bang and hidden deep in the interaction between htcondor, wmagent and CMS scripts and need to be fixed by the developers. That i7 2600 has 8 threads with hyperthreading on. One thread was running CMS and 7 WCG Open Pandemic. With that load I'm not able to finish 2 CMS sub-jobs within the 18 hours-limit and the task will be killed. In my success example I reduced the load to 4 busy threads to be able to finish 2 sub-jobs on time. The VM CPU-throttling was on purpose to let run the VM overnight, so I could watch how it would finish the next morning and to be sure that the sub-job would pass the 12 hour mark. I don't think the VM-throttling has a negative effect, but when, that would be an avoidable error too. Like you, I think there is something wrong when only 1 sub-job has finished. In my enclosed picture you see that the sub-job finished OK (Ivan could probably see that in his DB), so there is something wrong with the job-handling by Condor/WMAgent/or what else. ID: 42977 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1449 Credit: 9,734,540 RAC: 602	Message 42981 - Posted: 8 Jul 2020, 11:22:11 UTC I started yesterday 4 CMS-tasks concurrently and after about 1 hour they were the only BOINC-tasks on the 8 threads. This way they would do at least 2 VM inside jobs. 2 tasks did 2 jobs, the other 2 tasks did 3 jobs. I had to restart BOINC once, because 1 CMS-task got the "postponed waiting 86400 sec." status. All 4 tasks finished OK, so I think, when a task has done only 1 job, it will fail BOINCwise. ID: 42981 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1090 Credit: 9,319,923 RAC: 1,403	Message 43058 - Posted: 14 Jul 2020, 13:21:47 UTC Interruption to CMS@Home tomorrow: see https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5478&postid=43057 ID: 43058 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1090 Credit: 9,319,923 RAC: 1,403	Message 43065 - Posted: 15 Jul 2020, 15:08:47 UTC - in response to Message 43058. We are running again. ID: 43065 · Reply Quote

Harri Liljeroos Send message Joined: 28 Sep 04 Posts: 764 Credit: 56,553,470 RAC: 27,082	Message 43087 - Posted: 19 Jul 2020, 18:14:00 UTC All my CMS tasks for this afternoon have started to fail with error 206 (0x000000CE) EXIT_INIT_FAILURE. Here's one example: https://lhcathome.cern.ch/lhcathome/result.php?resultid=279343405 ID: 43087 · Reply Quote

Greger Send message Joined: 9 Jan 15 Posts: 151 Credit: 431,596,822 RAC: 0	Message 43089 - Posted: 19 Jul 2020, 19:29:33 UTC - in response to Message 43087. All my CMS tasks for this afternoon have started to fail with error 206 (0x000000CE) EXIT_INIT_FAILURE. Here's one example: https://lhcathome.cern.ch/lhcathome/result.php?resultid=279343405 Same for me and most likely others to. Proxy issue and state that it could not pem file. 2020-07-19 19:27:47 (16908): Guest Log: ERROR: Couldn't read proxy from: /tmp/x509up_u0 2020-07-19 19:27:47 (16908): Guest Log: globus_credential: Error reading proxy credential 2020-07-19 19:27:47 (16908): Guest Log: globus_credential: Error reading proxy credential: Couldn't read PEM from bio 2020-07-19 19:27:47 (16908): Guest Log: OpenSSL Error: pem_lib.c:703: in library: PEM routines, function PEM_read_bio: no start line ID: 43089 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1449 Credit: 9,734,540 RAC: 602	Message 43302 - Posted: 30 Aug 2020, 19:21:53 UTC - in response to Message 42981. I started yesterday 4 CMS-tasks concurrently and after about 1 hour they were the only BOINC-tasks on the 8 threads. This way they would do at least 2 VM inside jobs. 2 tasks did 2 jobs, the other 2 tasks did 3 jobs. I had to restart BOINC once, because 1 CMS-task got the "postponed waiting 86400 sec." status. All 4 tasks finished OK, so I think, when a task has done only 1 job, it will fail BOINCwise. I tested this again. Able to run only 1 cmsRun during a BOINC-task and ready, uploading the result after 12 hours elapsed time. Again it failed BOINC-wise with Unknown error code https://lhcathome.cern.ch/lhcathome/result.php?resultid=282223010 ID: 43302 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1449 Credit: 9,734,540 RAC: 602	Message 43323 - Posted: 9 Sep 2020, 9:26:33 UTC - in response to Message 43302. I tested now with 4 tasks running not using hyperthreading and therefore able to do 2 cmsrun's during the elapsed time. Now all 4 were valid. Conclusion: Condor exits with an error when only 1 job is done during the VM-lifetime. I cannot use all 8 threads, cause CMS is not able to run 2 sub-tasks on that machine within the maximum of 18 hours elapsed. CMS could reduce the # of events from 10,000 to 5,000 or solve the Condor problem. ID: 43323 · Reply Quote

LHC@home