Thread '1 (0x00000001) Unknown error code'

Author	Message
computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2725 Credit: 300,276,519 RAC: 46,481	Message 39745 - Posted: 29 Aug 2019, 13:50:13 UTC Most of my CMS tasks fail with "1 (0x00000001) Unknown error code". Grafana shows a failure rate of more than 90% since this afternoon. Should be investigated. https://monit-grafana.cern.ch/d/000000628/cms-job-monitoring?orgId=11&from=now-6h&to=now-12m&refresh=15m&var-group_by=CMS_JobType&var-Tier=All&var-Site=T3_CH_Volunteer&var-Type=All&var-CMS_JobType=All&var-CMSPrimaryDataTier=All&var-binning=12m&var-measurement=condor_12m&var-retention_policy=sample https://monit-grafana.cern.ch/d/000000628/cms-job-monitoring?orgId=11&from=now-6h&to=now-12m&refresh=15m&var-group_by=CMS_JobType&var-Tier=All&var-Site=T3_CH_Volunteer&var-Type=All&var-CMS_JobType=All&var-CMSPrimaryDataTier=All&var-binning=12m&var-measurement=condor_12m&var-retention_policy=sample&fullscreen&panelId=81 ID: 39745 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1152 Credit: 11,734,920 RAC: 442	Message 39746 - Posted: 29 Aug 2019, 14:19:41 UTC - in response to Message 39745. Yes, I just saw that. Unfortunately I'm at a meeting several hours north of London this week with only a very old and slow netbook to access our web pages. I didn't get around to checking the queues this morning and they drained sooner than I expected. I've submitted a new batch of jobs which should last until I'm home again. As well, a WMAgent module has failed this morning -- something to do with DB polling so I don't think that affects the job queues per se (although, the batch I just sent hasn't shown up in the monitor yet). I've sent an e-mail to the CERN crew. ID: 39746 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1152 Credit: 11,734,920 RAC: 442	Message 39747 - Posted: 29 Aug 2019, 15:04:27 UTC - in response to Message 39746. That batch is now being seen -- about 70 are already running. ID: 39747 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2725 Credit: 300,276,519 RAC: 46,481	Message 39836 - Posted: 6 Sep 2019, 8:29:54 UTC Since 6:48 UTC there's an increasing number of failed tasks. Looks like we are out of jobs. ID: 39836 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1152 Credit: 11,734,920 RAC: 442	Message 39841 - Posted: 6 Sep 2019, 10:32:53 UTC - in response to Message 39836. Last modified: 6 Sep 2019, 10:34:43 UTC Since 6:48 UTC there's an increasing number of failed tasks. Looks like we are out of jobs. No. What happened is that yesterday we identified why the "new" CentOS7 jobs were failing. This morning Laurence made a change to our condor config to let them run. I had a batch of 1000 jobs with ~750 still queued; the change let them run but most of them immediately failed because they had been queued too long. However, seven did run! It looks like they created ~6 MB of output in 15 minutes of wall-clock time. Both Federica and I had already submitted new workflows before I tracked down the relevant log files. I've now submitted another batch requesting four times the number of events. So, you'll probably see a number of short jobs later in the weekend once these new jobs make it into the queue, followed by some longer jobs (there are still nearly 5,000 "old" jobs pending). I'll monitor as best I can over the weekend and make adjustments as necessary. Not ideal timing, but we have to go with what fate deals us. ID: 39841 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1152 Credit: 11,734,920 RAC: 442	Message 39865 - Posted: 8 Sep 2019, 15:30:31 UTC - in response to Message 39841. The "old" jobs are taking a bit longer than I expected, so we should be starting the newer batches some time tomorrow. So, I can go home, listen to the last of the cricket, and watch "Antiques Roadshow"! (I bought a little TV last week; amazing the technology you can get for Â£119 these days -- 120 TV channels and 80 radio stations too! Still no internet, tho'but...) ID: 39865 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1972 Credit: 159,579,835 RAC: 48,280	Message 39866 - Posted: 8 Sep 2019, 16:51:10 UTC - in response to Message 39865. Ivan, have a nice and enjoyable evening :-) ID: 39866 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1972 Credit: 159,579,835 RAC: 48,280	Message 39891 - Posted: 10 Sep 2019, 12:43:47 UTC Can anyone tell my what this task https://lhcathome.cern.ch/lhcathome/result.php?resultid=245102192 failed after almost 16 hours? Really too bad :-( ... Guest Log: [ERROR] Condor ended after 57868 seconds ... ID: 39891 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1152 Credit: 11,734,920 RAC: 442	Message 39893 - Posted: 10 Sep 2019, 15:59:32 UTC - in response to Message 39891. That appears to be a consequence of the queues draining more quickly than I anticipated last night. The task appears to have run several jobs, but then requested a new one when the queue was (nearly?) empty, and wasn't (properly?) allocated a new job. Eventually it timed out due to no reply. ID: 39893 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1152 Credit: 11,734,920 RAC: 442	Message 39895 - Posted: 10 Sep 2019, 16:08:41 UTC Sorry, an earlier message I meant to send to this thread appears to have gone missing. Perhaps I forgot to hit "Post". Longer jobs are on their way, we are tuning the parameters to try to get an acceptable time vs file-size ratio. Let me know of any problems. ID: 39895 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1972 Credit: 159,579,835 RAC: 48,280	Message 41067 - Posted: 25 Dec 2019, 6:48:47 UTC one of my tasks failed after some 6 hours with 1 (0x00000001) Unknown error code ... 2019-12-24 19:57:17 (17992): Guest Log: [ERROR] Condor ended after 20653 seconds https://lhcathome.cern.ch/lhcathome/result.php?resultid=256603067 any idea what could be the reason? ID: 41067 · Reply Quote

Aaron Send message Joined: 5 May 10 Posts: 10 Credit: 8,948,775 RAC: 8	Message 41423 - Posted: 29 Jan 2020, 20:17:06 UTC I recently started contributing to this project again and having the same issue with the unknown error code. The CMS app is the only app running on my CPU right now and the error seems to happen frequently. ID: 41423 · Reply Quote

Magic Quantum Mechanic Send message Joined: 24 Oct 04 Posts: 1294 Credit: 95,326,344 RAC: 25,250	Message 41587 - Posted: 14 Feb 2020, 22:23:04 UTC https://lhcathome.cern.ch/lhcathome/result.php?resultid=263294061 Well I got about 10 of these time wasters so far today with 2 still running on this pc with maybe 5 total still running on the 24 cores and 56GB ram (and both I had running over at -dev did the same thing) I guess I won't get up early to do that again in the morning. ID: 41587 · Reply Quote

[AF>Le_Pommier] Jerome_C2005 Send message Joined: 12 Jul 11 Posts: 120 Credit: 1,451,119 RAC: 0	Message 41698 - Posted: 23 Feb 2020, 11:31:28 UTC Last modified: 23 Feb 2020, 11:31:43 UTC Hi since CMS tasks are back I tried to give it a go on my iMac : all the tasks are ending in error. It is the "1 (0x00000001) Unknown error code". ID: 41698 · Reply Quote

NOGOOD Send message Joined: 18 Nov 17 Posts: 134 Credit: 59,120,336 RAC: 4,947	Message 41700 - Posted: 23 Feb 2020, 11:51:14 UTC - in response to Message 41698. Thank you for information. So, I keep running Atlas and waiting for stable CMS back. ID: 41700 · Reply Quote