Message boards : CMS Application : EXIT_INIT_FAILURE after several jobs already finished
Message board moderation

To post messages, you must log in.

AuthorMessage
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 435
Credit: 23,074,733
RAC: 14,562
Message 29203 - Posted: 12 Mar 2017, 10:02:02 UTC

See this task: https://lhcathome.cern.ch/lhcathome/result.php?resultid=124693777

It had alredy finished several jobs while running (about 8h running) but then suddenly it finished with this error. How are the different jobs related to each other? Do they form a single entity which has to be finished to be successful and get the credit? Or should the credit be based on the amount of finished jobs and not just the run time / one error?

If the finished jobs are useful, then the credit should be counted based on them.
ID: 29203 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1443
Credit: 76,581,814
RAC: 100,004
Message 29204 - Posted: 12 Mar 2017, 10:21:23 UTC - in response to Message 29203.  

Your WUs need at least 1 successfully finished job to get any credit.
This message shows that your WU did not finish a job.
VM Completion Message: Condor exited after 45842s without running a job.

The runtime (45842s) is unusually long and should be investigated by the developers.
May be it is related to the stageout problems I reported here.
ID: 29204 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 435
Credit: 23,074,733
RAC: 14,562
Message 29207 - Posted: 12 Mar 2017, 12:40:06 UTC - in response to Message 29204.  
Last modified: 12 Mar 2017, 12:40:35 UTC

Yes, I see it now when looking more closely to the stderr. The task is pausing many times because Einstein GPU tasks require more CPU than the Seti tasks my GPUs are mostly running. So after the CMS task restarts the "job finished" line never comes, only the "New job starting" line sometime after task restart.

On some succesful tasks I can sometimes see the job finished line but not after every pause and restart.

I have now reduced the amount of CPU reserved for Einstein GPU tasks and also limited only 1 CMS task to run concurrently. I hope that developers find a solution to this behaviour of not finishing jobs after task pause.
ID: 29207 · Report as offensive     Reply Quote

Message boards : CMS Application : EXIT_INIT_FAILURE after several jobs already finished


©2020 CERN