Message boards :
LHCb Application :
every task failing after about 12 minutes
Message board moderation
Author | Message |
---|---|
Send message Joined: 18 Dec 15 Posts: 1814 Credit: 118,498,107 RAC: 30,817 |
For the first time, I now have tried LHCb on the same system (Windows 10 64-bit) on which I have been successfully crunching CMS tasks for many weeks. Excerpt from the stderr Output: 2017-04-29 19:39:35 (9632): Guest Log: [INFO] LHCb application starting. Check log files. 2017-04-29 19:39:35 (9632): Guest Log: [DEBUG] HTCondor ping 2017-04-29 19:39:35 (9632): Guest Log: [DEBUG] 0 2017-04-29 19:50:16 (9632): Guest Log: [ERROR] Condor exited after 638s without running a job. 2017-04-29 19:50:16 (9632): Guest Log: [INFO] Shutting Down. The complete stderr Information can be seen here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=136912805 Can anyone tell me what's going wrong? |
Send message Joined: 15 Jun 08 Posts: 2534 Credit: 253,873,267 RAC: 38,830 |
This is typical for an empty task queue (as it is for all condor feeded subprojects). Better to uncheck this subproject until new tasks are available. See also: https://lhcathomedev.cern.ch/lhcathome-dev/lhcb_job.php |
Send message Joined: 18 Dec 15 Posts: 1814 Credit: 118,498,107 RAC: 30,817 |
This is typical for an empty task queue (as it is for all condor feeded subprojects). hm, this I do not quite understand now. According to the Server Status page, there were plenty of tasks available, otherwise I would not have been able to download any, right? Also, once a downloaded task starts crunching, what difference would it make if there are many more or none at all in the queue on the server? I guess there must be some other reason for the failures. The way this looked reminded me of similar situations once in a while with ATLAS and also CMS jobs, which errored out short time after they got startet. Although there were enough available in the download queue. |
Send message Joined: 27 Sep 08 Posts: 847 Credit: 691,750,093 RAC: 115,162 |
The LHCb and BOINC task queues are not the same. The queue at LHBc is seen by the VM and the queue here is seen by boinc. I think the one(s) at boinc are automatically generated so there is 100 all the time. The work at LHCb is batched in by scientist I don't think the scientist have a way to manage the BOINC queues |
Send message Joined: 18 Dec 15 Posts: 1814 Credit: 118,498,107 RAC: 30,817 |
Toby, just for my understanding (obviously I am missing something): Right now (like yesterday when I downloaded some Tasks which then errored out), the Server Status page shows 99 LHCb tasks: https://lhcathome.cern.ch/lhcathome/server_status.php However, in reality this means that there are NO LHCb tasks available for crunching? |
Send message Joined: 27 Sep 08 Posts: 847 Credit: 691,750,093 RAC: 115,162 |
Yes, the queue of 100 on server status doesn't guarantee work, the link that computermizzel posted shows the work performed by people, so if this is low there is likely a problem for everyone |
Send message Joined: 18 Dec 15 Posts: 1814 Credit: 118,498,107 RAC: 30,817 |
okay, I now understand. But this means that the availability figures in the Server Status page are more or less irrelevant :-( |
Send message Joined: 24 Jul 16 Posts: 88 Credit: 239,917 RAC: 0 |
There are maybe solutions to this problem : 1° ) It would be an improvement if a logical test on the boinc queue can be possible : " number tasks boinc queue always less or equal than number tasks remaining in the batch scientific LHCb ". So no further LHCb tasks would be sent to volunteers while no scientific batch has been really done. And it would be easier for the persons who manage the scientific task batch to be warned when number is less than 75, for instance. 2° ) Is there a possibility to use the same mechanism as ATLAS queue ? This system is not perfect because sometimes in the past batches of wus were not enough for multicores in the former ATLAS site but it is more understandable. ...It's only suggestions , i don't know the difficulty to realize it... |
Send message Joined: 15 Jun 08 Posts: 2534 Credit: 253,873,267 RAC: 38,830 |
+1 Nearly the same idea than here. I put mine in a more general part of the board as it affects all VM based subprojects. |
Send message Joined: 18 Dec 15 Posts: 1814 Credit: 118,498,107 RAC: 30,817 |
A similar situation occurs with ATLAS tasks right now: the Project Status page shows some 330 available, but when trying to "fetch work", BOINC says "no tasks available for ATLAS simulation". So, in contrast to what happened with LHCb, with ATLAS now BOINC would not download any tasks (which then would not work anyway). So there seems to be some inconsistency within LHC concerning how BOINC should handle non-existing tasks. |
Send message Joined: 26 Apr 17 Posts: 7 Credit: 22,463 RAC: 0 |
Hi yes, ATLAS and LHCb (and CMS) behave in a little different way. As far as I know, atlas has a specific subset of jobs to be processed by the community. LHCb submits to the VMs from the same 'queue' as for all other processing sites. Anyway, the issue with 206 error should be fixed now. Jobs have always been available but the batch system behind the boinc server was preventing the submission to the VMs in some cases. Please, try to run LHCb jobs now! |
©2024 CERN