Message boards : CMS Application : EXIT_NO_SUB_TASKS
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 16 · Next

AuthorMessage
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1114
Credit: 49,503,137
RAC: 3,956
Message 40606 - Posted: 23 Nov 2019, 0:47:41 UTC - in response to Message 40598.  

So far I only see one Valid Run time 7 hours 32 min 42 sec

I am starting up a new one just to see if it even gets to HTCondor Ping in <13mins

But if you still have those running that pretty much means we may be back to running Valids again (I Hope)
ID: 40606 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 998
Credit: 6,264,307
RAC: 71
Message 40608 - Posted: 23 Nov 2019, 11:53:10 UTC

Oops! I'm getting an error trying to submit new jobs to the queues. Jobs will run out late tonight if we can't get it sorted. I'm setting my machines to No New Tasks to try to lessen the load.
ID: 40608 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 998
Credit: 6,264,307
RAC: 71
Message 40609 - Posted: 23 Nov 2019, 13:20:25 UTC - in response to Message 40608.  

We're working on it -- it's a prolongation of Thursday night's problem. First attempt at using a different server failed -- by the looks of it because my public key is not registered on the production server. I've sent CERN the corresponding key.
ID: 40609 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 998
Credit: 6,264,307
RAC: 71
Message 40610 - Posted: 23 Nov 2019, 13:41:21 UTC - in response to Message 40608.  

Ah, I had an over-ride switch in my submission script that still pointed to the testbed server. Changed that and the submission went through. Now I'm waiting to see if the workflow actually shows up on the production server.
ID: 40610 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 998
Credit: 6,264,307
RAC: 71
Message 40611 - Posted: 23 Nov 2019, 14:22:46 UTC - in response to Message 40610.  

Ah, I had an over-ride switch in my submission script that still pointed to the testbed server. Changed that and the submission went through. Now I'm waiting to see if the workflow actually shows up on the production server.

OK, I see the workflow on the production server now but I may have to go home before it starts farming out any jobs. Digits cruciate...
ID: 40611 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1114
Credit: 49,503,137
RAC: 3,956
Message 40626 - Posted: 24 Nov 2019, 4:59:43 UTC

Suspend Time Again

Failing here and over at -dev
ID: 40626 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1114
Credit: 49,503,137
RAC: 3,956
Message 40627 - Posted: 24 Nov 2019, 11:34:08 UTC

No guarantee yet but I started up a new batch and so far so good so I will check back in a few hours and see if they stay running this time (3:33am right now)
ID: 40627 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 998
Credit: 6,264,307
RAC: 71
Message 40628 - Posted: 24 Nov 2019, 13:20:53 UTC

The new batch on the production server doesn't seem to have created any jobs, so obviously it's not sending any out. We are running down the existing queues on the testbed server at reduced efficiency (i.e. not every job request is met within ten minutes). I'll send out more messages, but it is Sunday...
ID: 40628 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,408,403
RAC: 102,477
Message 40649 - Posted: 25 Nov 2019, 11:22:33 UTC - in response to Message 40628.  

... I'll send out more messages, but it is Sunday...
Ivan, any news?
ID: 40649 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,408,403
RAC: 102,477
Message 40656 - Posted: 25 Nov 2019, 16:19:48 UTC

A few hours ago I took the risk and downloaded several CMS tasks, and then I left for a while.

After coming back about 3 hours later, I noticed that 3 of them had failed with 207 (0x000000CF) EXIT_NO_SUB_TASKS, and 2 of them had failed with 1 (0x00000001) Unknown error code after almost 3 hours: https://lhcathome.cern.ch/lhcathome/result.php?resultid=252857843

Unfortunately, everything was a waste of CPU time :-(

What's going wrong with CMS?
ID: 40656 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,408,403
RAC: 102,477
Message 40661 - Posted: 25 Nov 2019, 19:02:59 UTC
Last modified: 25 Nov 2019, 19:03:41 UTC

here the next ones which fail almost exactly 2:40 hrs after start, with: 1 (0x00000001) Unknown error code:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=252855754
https://lhcathome.cern.ch/lhcathome/result.php?resultid=252858062

why has CMS become that unstable lately?
ID: 40661 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1114
Credit: 49,503,137
RAC: 3,956
Message 40672 - Posted: 26 Nov 2019, 0:11:53 UTC - in response to Message 40661.  

Yeah don't waste time running these Erich until we get the server end of this taken care of.
I tested one and it ran 5 hours and then crashed.
These CMS VB tasks are famous for tricking us into thinking they will run Valids and then this happens.

Just wait for the *Ivan Report* and I see the CERN Service Portal is not updated either.

(I did expect them to be up and running here today)
ID: 40672 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 998
Credit: 6,264,307
RAC: 71
Message 40674 - Posted: 26 Nov 2019, 9:30:22 UTC - in response to Message 40672.  

Apparently a database copy went wrong, complicated by the network problems Thursday night, people working in different time-zones, and the weekend -- plus some lack of communication which led to effort being wasted over the weekend. I'm still waiting on further information from the service ticket (unfortunately this one is CERN internal so most of you won't be able to read it).
ID: 40674 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,408,403
RAC: 102,477
Message 40677 - Posted: 26 Nov 2019, 12:57:18 UTC - in response to Message 40674.  

Thanks, Ivan, for the information.
So I/we will wait until further word from you.
ID: 40677 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 998
Credit: 6,264,307
RAC: 71
Message 40687 - Posted: 26 Nov 2019, 17:05:19 UTC

Probably need to wait a bit longer. Some difficulties were ironed out today, but not all. I have to go home soon; the North American contingent may sort it out overnight.
ID: 40687 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 602
Credit: 24,371,321
RAC: 0
Message 40688 - Posted: 26 Nov 2019, 17:09:01 UTC - in response to Message 40687.  

...the North American contingent may sort it out overnight.

Our Thanksgiving is coming up Thursday. Everyone will be off Friday. Good luck.
ID: 40688 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,408,403
RAC: 102,477
Message 40689 - Posted: 26 Nov 2019, 17:58:19 UTC - in response to Message 40687.  

the North American contingent may sort it out overnight.
so let's keep our fingers crossed :-)
ID: 40689 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 998
Credit: 6,264,307
RAC: 71
Message 40699 - Posted: 27 Nov 2019, 8:15:09 UTC

Big problem, I'm afraid -- the database tables appear to be empty!
The problem cannot be fixed quickly. If the tables are empty now, the only option is to repeat the import. And this will take few days as the amount of data to be copied is huge and the tables do not have partitions so it is impossible to parallelise the work.
Also the IO subsystem for the integration databases is not as fast as the production databases.....
An alternative to all this import/export would be needed in the near future....

...and Thursday and Friday are US holidays...
ID: 40699 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,964,869
RAC: 136,676
Message 40701 - Posted: 27 Nov 2019, 8:26:56 UTC - in response to Message 40699.  
Last modified: 27 Nov 2019, 8:29:38 UTC

Oops!
Calm down and don't forget to breath.
... and Happy Thanksgiving ...

<edit>
Forgot to ask:
Can the CMS tasks be stopped at the BOINC server?
</edit>
ID: 40701 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1114
Credit: 49,503,137
RAC: 3,956
Message 40702 - Posted: 27 Nov 2019, 8:30:35 UTC

Thanks Ivan, now I can unplug my satellite modem for the night so I don't lose my only high-speed I have left for the month,

And yes Happy Thanksgiving (eating turkey and watching 9+ hours of NFL football)
ID: 40702 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 16 · Next

Message boards : CMS Application : EXIT_NO_SUB_TASKS


©2024 CERN