Thread 'every task failing after about 12 minutes'

Author	Message
Erich56 Send message Joined: 18 Dec 15 Posts: 1984 Credit: 161,657,123 RAC: 75,043	Message 30113 - Posted: 29 Apr 2017, 18:02:49 UTC For the first time, I now have tried LHCb on the same system (Windows 10 64-bit) on which I have been successfully crunching CMS tasks for many weeks. Excerpt from the stderr Output: 2017-04-29 19:39:35 (9632): Guest Log: [INFO] LHCb application starting. Check log files. 2017-04-29 19:39:35 (9632): Guest Log: [DEBUG] HTCondor ping 2017-04-29 19:39:35 (9632): Guest Log: [DEBUG] 0 2017-04-29 19:50:16 (9632): Guest Log: [ERROR] Condor exited after 638s without running a job. 2017-04-29 19:50:16 (9632): Guest Log: [INFO] Shutting Down. The complete stderr Information can be seen here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=136912805 Can anyone tell me what's going wrong? ID: 30113 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2753 Credit: 303,638,221 RAC: 106,310	Message 30114 - Posted: 29 Apr 2017, 19:58:35 UTC - in response to Message 30113. This is typical for an empty task queue (as it is for all condor feeded subprojects). Better to uncheck this subproject until new tasks are available. See also: https://lhcathomedev.cern.ch/lhcathome-dev/lhcb_job.php ID: 30114 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1984 Credit: 161,657,123 RAC: 75,043	Message 30121 - Posted: 30 Apr 2017, 4:59:56 UTC - in response to Message 30114. This is typical for an empty task queue (as it is for all condor feeded subprojects). hm, this I do not quite understand now. According to the Server Status page, there were plenty of tasks available, otherwise I would not have been able to download any, right? Also, once a downloaded task starts crunching, what difference would it make if there are many more or none at all in the queue on the server? I guess there must be some other reason for the failures. The way this looked reminded me of similar situations once in a while with ATLAS and also CMS jobs, which errored out short time after they got startet. Although there were enough available in the download queue. ID: 30121 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 946 Credit: 783,601,032 RAC: 161,244	Message 30123 - Posted: 30 Apr 2017, 6:39:11 UTC The LHCb and BOINC task queues are not the same. The queue at LHBc is seen by the VM and the queue here is seen by boinc. I think the one(s) at boinc are automatically generated so there is 100 all the time. The work at LHCb is batched in by scientist I don't think the scientist have a way to manage the BOINC queues ID: 30123 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1984 Credit: 161,657,123 RAC: 75,043	Message 30124 - Posted: 30 Apr 2017, 7:01:21 UTC Toby, just for my understanding (obviously I am missing something): Right now (like yesterday when I downloaded some Tasks which then errored out), the Server Status page shows 99 LHCb tasks: https://lhcathome.cern.ch/lhcathome/server_status.php However, in reality this means that there are NO LHCb tasks available for crunching? ID: 30124 · Reply Quote

Toby Broom Volunteer moderator Send message Joined: 27 Sep 08 Posts: 946 Credit: 783,601,032 RAC: 161,244	Message 30125 - Posted: 30 Apr 2017, 7:32:32 UTC Yes, the queue of 100 on server status doesn't guarantee work, the link that computermizzel posted shows the work performed by people, so if this is low there is likely a problem for everyone ID: 30125 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1984 Credit: 161,657,123 RAC: 75,043	Message 30126 - Posted: 30 Apr 2017, 7:35:25 UTC - in response to Message 30125. okay, I now understand. But this means that the availability figures in the Server Status page are more or less irrelevant :-( ID: 30126 · Reply Quote

PHILIPPE Send message Joined: 24 Jul 16 Posts: 88 Credit: 239,917 RAC: 0	Message 30128 - Posted: 30 Apr 2017, 9:32:29 UTC - in response to Message 30126. Last modified: 30 Apr 2017, 9:39:10 UTC There are maybe solutions to this problem : 1Â° ) It would be an improvement if a logical test on the boinc queue can be possible : " number tasks boinc queue always less or equal than number tasks remaining in the batch scientific LHCb ". So no further LHCb tasks would be sent to volunteers while no scientific batch has been really done. And it would be easier for the persons who manage the scientific task batch to be warned when number is less than 75, for instance. 2Â° ) Is there a possibility to use the same mechanism as ATLAS queue ? This system is not perfect because sometimes in the past batches of wus were not enough for multicores in the former ATLAS site but it is more understandable. ...It's only suggestions , i don't know the difficulty to realize it... ID: 30128 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2753 Credit: 303,638,221 RAC: 106,310	Message 30131 - Posted: 30 Apr 2017, 10:56:48 UTC - in response to Message 30128. +1 Nearly the same idea than here. I put mine in a more general part of the board as it affects all VM based subprojects. ID: 30131 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1984 Credit: 161,657,123 RAC: 75,043	Message 30135 - Posted: 30 Apr 2017, 13:25:39 UTC A similar situation occurs with ATLAS tasks right now: the Project Status page shows some 330 available, but when trying to "fetch work", BOINC says "no tasks available for ATLAS simulation". So, in contrast to what happened with LHCb, with ATLAS now BOINC would not download any tasks (which then would not work anyway). So there seems to be some inconsistency within LHC concerning how BOINC should handle non-existing tasks. ID: 30135 · Reply Quote

Luca Tomassetti Send message Joined: 26 Apr 17 Posts: 7 Credit: 22,463 RAC: 0	Message 30204 - Posted: 4 May 2017, 12:32:48 UTC - in response to Message 30135. Hi yes, ATLAS and LHCb (and CMS) behave in a little different way. As far as I know, atlas has a specific subset of jobs to be processed by the community. LHCb submits to the VMs from the same 'queue' as for all other processing sites. Anyway, the issue with 206 error should be fixed now. Jobs have always been available but the batch system behind the boinc server was preventing the submission to the VMs in some cases. Please, try to run LHCb jobs now! ID: 30204 · Reply Quote