Message boards : LHCb Application : every task failing after about 12 minutes
Message board moderation

To post messages, you must log in.

AuthorMessage
Erich56

Send message
Joined: 18 Dec 15
Posts: 1814
Credit: 118,498,107
RAC: 30,817
Message 30113 - Posted: 29 Apr 2017, 18:02:49 UTC

For the first time, I now have tried LHCb on the same system (Windows 10 64-bit) on which I have been successfully crunching CMS tasks for many weeks.

Excerpt from the stderr Output:

2017-04-29 19:39:35 (9632): Guest Log: [INFO] LHCb application starting. Check log files.
2017-04-29 19:39:35 (9632): Guest Log: [DEBUG] HTCondor ping
2017-04-29 19:39:35 (9632): Guest Log: [DEBUG] 0
2017-04-29 19:50:16 (9632): Guest Log: [ERROR] Condor exited after 638s without running a job.
2017-04-29 19:50:16 (9632): Guest Log: [INFO] Shutting Down.

The complete stderr Information can be seen here:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=136912805

Can anyone tell me what's going wrong?
ID: 30113 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2534
Credit: 253,873,267
RAC: 38,830
Message 30114 - Posted: 29 Apr 2017, 19:58:35 UTC - in response to Message 30113.  

This is typical for an empty task queue (as it is for all condor feeded subprojects).
Better to uncheck this subproject until new tasks are available.
See also:
https://lhcathomedev.cern.ch/lhcathome-dev/lhcb_job.php
ID: 30114 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1814
Credit: 118,498,107
RAC: 30,817
Message 30121 - Posted: 30 Apr 2017, 4:59:56 UTC - in response to Message 30114.  

This is typical for an empty task queue (as it is for all condor feeded subprojects).

hm, this I do not quite understand now. According to the Server Status page, there were plenty of tasks available, otherwise I would not have been able to download any, right?
Also, once a downloaded task starts crunching, what difference would it make if there are many more or none at all in the queue on the server?

I guess there must be some other reason for the failures. The way this looked reminded me of similar situations once in a while with ATLAS and also CMS jobs, which errored out short time after they got startet. Although there were enough available in the download queue.
ID: 30121 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 847
Credit: 691,750,093
RAC: 115,162
Message 30123 - Posted: 30 Apr 2017, 6:39:11 UTC

The LHCb and BOINC task queues are not the same.

The queue at LHBc is seen by the VM and the queue here is seen by boinc.

I think the one(s) at boinc are automatically generated so there is 100 all the time.

The work at LHCb is batched in by scientist

I don't think the scientist have a way to manage the BOINC queues
ID: 30123 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1814
Credit: 118,498,107
RAC: 30,817
Message 30124 - Posted: 30 Apr 2017, 7:01:21 UTC

Toby, just for my understanding (obviously I am missing something):

Right now (like yesterday when I downloaded some Tasks which then errored out), the Server Status page shows 99 LHCb tasks:
https://lhcathome.cern.ch/lhcathome/server_status.php

However, in reality this means that there are NO LHCb tasks available for crunching?
ID: 30124 · Report as offensive     Reply Quote
Toby Broom
Volunteer moderator

Send message
Joined: 27 Sep 08
Posts: 847
Credit: 691,750,093
RAC: 115,162
Message 30125 - Posted: 30 Apr 2017, 7:32:32 UTC

Yes, the queue of 100 on server status doesn't guarantee work, the link that computermizzel posted shows the work performed by people, so if this is low there is likely a problem for everyone
ID: 30125 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1814
Credit: 118,498,107
RAC: 30,817
Message 30126 - Posted: 30 Apr 2017, 7:35:25 UTC - in response to Message 30125.  

okay, I now understand. But this means that the availability figures in the Server Status page are more or less irrelevant :-(
ID: 30126 · Report as offensive     Reply Quote
PHILIPPE

Send message
Joined: 24 Jul 16
Posts: 88
Credit: 239,917
RAC: 0
Message 30128 - Posted: 30 Apr 2017, 9:32:29 UTC - in response to Message 30126.  
Last modified: 30 Apr 2017, 9:39:10 UTC

There are maybe solutions to this problem :

1° ) It would be an improvement if a logical test on the boinc queue can be possible :

" number tasks boinc queue always less or equal than number tasks remaining in the batch scientific LHCb ".

So no further LHCb tasks would be sent to volunteers while no scientific batch has been really done. And it would be easier for the persons who manage the scientific task batch to be warned when number is less than 75, for instance.

2° ) Is there a possibility to use the same mechanism as ATLAS queue ?
This system is not perfect because sometimes in the past batches of wus were not enough for multicores in the former ATLAS site but it is more understandable.

...It's only suggestions , i don't know the difficulty to realize it...
ID: 30128 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2534
Credit: 253,873,267
RAC: 38,830
Message 30131 - Posted: 30 Apr 2017, 10:56:48 UTC - in response to Message 30128.  

+1

Nearly the same idea than here.
I put mine in a more general part of the board as it affects all VM based subprojects.
ID: 30131 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1814
Credit: 118,498,107
RAC: 30,817
Message 30135 - Posted: 30 Apr 2017, 13:25:39 UTC

A similar situation occurs with ATLAS tasks right now:
the Project Status page shows some 330 available, but when trying to "fetch work", BOINC says "no tasks available for ATLAS simulation".

So, in contrast to what happened with LHCb, with ATLAS now BOINC would not download any tasks (which then would not work anyway).

So there seems to be some inconsistency within LHC concerning how BOINC should handle non-existing tasks.
ID: 30135 · Report as offensive     Reply Quote
Luca Tomassetti

Send message
Joined: 26 Apr 17
Posts: 7
Credit: 22,463
RAC: 0
Message 30204 - Posted: 4 May 2017, 12:32:48 UTC - in response to Message 30135.  

Hi

yes, ATLAS and LHCb (and CMS) behave in a little different way.

As far as I know, atlas has a specific subset of jobs to be processed by the community.
LHCb submits to the VMs from the same 'queue' as for all other processing sites.

Anyway, the issue with 206 error should be fixed now.
Jobs have always been available but the batch system behind the boinc server was preventing the submission to the VMs in some cases.

Please, try to run LHCb jobs now!
ID: 30204 · Report as offensive     Reply Quote

Message boards : LHCb Application : every task failing after about 12 minutes


©2024 CERN