Message boards : Theory Application : jobs is empty
Message board moderation

To post messages, you must log in.

AuthorMessage
Greger

Send message
Joined: 9 Jan 15
Posts: 151
Credit: 431,596,822
RAC: 0
Message 36802 - Posted: 21 Sep 2018, 16:03:18 UTC
Last modified: 21 Sep 2018, 16:05:58 UTC

Looking MCPLOT and it hit 0 and contributed CPU time is also 0 and MCPLOTS spend no time to generate new. I have seen a post about server issue and backup server replacement but wounder if new jobs is the cause of this?
Hosts running and get some jobs but may not run fulltime and idle, have checked a few task and sum is diffrent in jobs done.

Could project admin announce if new jobs would get out and fill the task with jobs that are sent out?

Should we suspend the tasks until batch system is back to normal. Any info to users and guidelines would be appreciated.
ID: 36802 · Report as offensive     Reply Quote
Greger

Send message
Joined: 9 Jan 15
Posts: 151
Credit: 431,596,822
RAC: 0
Message 36820 - Posted: 22 Sep 2018, 19:50:51 UTC

Looks like we are dry now

Exit status 207 (0x000000CF) EXIT_NO_SUB_TASKS
ID: 36820 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,341,521
RAC: 101,719
Message 36823 - Posted: 22 Sep 2018, 20:32:04 UTC - in response to Message 36820.  

Looks like we are dry now
indeed, the Project Status Page shows "0" for unsent tasks.
ID: 36823 · Report as offensive     Reply Quote
Profile Ray Murray
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 281
Credit: 11,859,285
RAC: 1
Message 36824 - Posted: 22 Sep 2018, 21:14:39 UTC
Last modified: 22 Sep 2018, 21:35:23 UTC

New Tasks are ending after 20mins or so with the No_subtasks error yet strangely, older Tasks, that made their connection before this problem started, ARE able to pick up new work.

No_subtasks over at -dev too.

MCPlots still broken.
ID: 36824 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 36826 - Posted: 22 Sep 2018, 21:57:40 UTC - in response to Message 36824.  

Same error here too... No subtasks.
Is it coincidence that CMS tasks are getting the same error?
I have LHCb tasks running that are getting subtasks but I think they were downloaded before the troubles started.
ID: 36826 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 674
Credit: 43,150,010
RAC: 16,024
Message 36827 - Posted: 22 Sep 2018, 22:10:48 UTC - in response to Message 36826.  

I always thought that the subtasks (jobs) are something that is downloaded inside the virtualbox when the task is running. This is why you need the constant connection to the different servers in Cern. I think this would applies to all VM tasks. It would be much more fault tolerant and convenient (to us, the crunchers) if everything necessary was downloaded when Boinc downloads the task and results were uploaded back when everything was finished.
ID: 36827 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1114
Credit: 49,501,728
RAC: 4,157
Message 36828 - Posted: 23 Sep 2018, 0:39:46 UTC - in response to Message 36802.  

You should just suspend all your Theory tasks if you have any left since they will just run 25mins and then become the usual computer error (aka server error)

And you could run LHCb's if you want to d/l the 940.47MB vdi
ID: 36828 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 36829 - Posted: 23 Sep 2018, 1:45:05 UTC - in response to Message 36827.  

I always thought that the subtasks (jobs) are something that is downloaded inside the virtualbox when the task is running. This is why you need the constant connection to the different servers in Cern. I think this would applies to all VM tasks.

LHCb tasks are still getting jobs but if I understand correctly they run under pilot not Condor.

It would be much more fault tolerant and convenient (to us, the crunchers) if everything necessary was downloaded when Boinc downloads the task and results were uploaded back when everything was finished.

That would require putting result files in the ../slots/#/shared folder where they could be tampered with. Then they would maybe need to run 2 iterations of every task to verify results. Hiding results in the VM makes it more secure.
Also, there may be thousands of non-BOINC hosts crunching Theory, LHCb and CMS native apps under Condor. Keeping everything in a VM means users don't have to compile and install CVMS, Singularity, etc. because that's all built in to the VM. And when the result files are ready they conveniently go to the same destination (Condor or pilot) that results from non-BOINC hosts submit their results to.
ID: 36829 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,084,917
RAC: 104,527
Message 36831 - Posted: 23 Sep 2018, 5:23:06 UTC

This task finished successful last night:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=207036080
MCProd say 100% lost ratio.
ID: 36831 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,341,521
RAC: 101,719
Message 36833 - Posted: 23 Sep 2018, 8:16:54 UTC

what I do not understand is why new tasks are made available for download (as seen from the Project Status Page) as long as there are no job available in the background.

Some automatic mechanism should be established to stop creation of new tasks whenever job creation fails.
ID: 36833 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 36835 - Posted: 23 Sep 2018, 13:17:20 UTC - in response to Message 36833.  

what I do not understand is why new tasks are made available for download (as seen from the Project Status Page) as long as there are no job available in the background.

Some automatic mechanism should be established to stop creation of new tasks whenever job creation fails.


Maybe such a mechanism already exists. Maybe it broke under circumstances they never anticipated. Remember... they're just physicists and IT pros not rocket scientists.
ID: 36835 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,341,521
RAC: 101,719
Message 36851 - Posted: 24 Sep 2018, 10:25:27 UTC

Could anyone from LHC give us information as to when jobs will be available again for the tasks that can still be downloaded?
ID: 36851 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 372
Credit: 238,712
RAC: 0
Message 36852 - Posted: 24 Sep 2018, 12:02:48 UTC - in response to Message 36851.  

I am investigating.
ID: 36852 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 372
Credit: 238,712
RAC: 0
Message 36853 - Posted: 24 Sep 2018, 12:50:22 UTC - in response to Message 36852.  

Due to a blockage with results being moved back to the mcplots server, it only gave out a trickle of new jobs. This resulted in the queue emptying quickly but not for long enough that tasks would stop being created. It has now being unblocked and jobs should start flowing again shortly.
ID: 36853 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,341,521
RAC: 101,719
Message 36854 - Posted: 24 Sep 2018, 15:35:21 UTC

Many thanks, Laurence, for investigating and re-activating the jobs delivery.
From what I can see so far, it seems to work now.

However, the Project Status Page is showing a continuous drop in the number of "unsent" Theory tasks (right now, it's at 92). Which may mean that now, or soon, we have jobs, but no tasks :-)
ID: 36854 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,341,521
RAC: 101,719
Message 36862 - Posted: 24 Sep 2018, 18:44:13 UTC - in response to Message 36854.  

However, the Project Status Page is showing a continuous drop in the number of "unsent" Theory tasks (right now, it's at 92). Which may mean that now, or soon, we have jobs, but no tasks :-)
right now the number of "unsent" tasks is ascending :-) so all looks good!
ID: 36862 · Report as offensive     Reply Quote

Message boards : Theory Application : jobs is empty


©2024 CERN