Message boards : News : CMS@Home up again
Message board moderation

To post messages, you must log in.

AuthorMessage
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 724
Credit: 5,688,626
RAC: 716
Message 41545 - Posted: 12 Feb 2020, 11:14:38 UTC

OK, jobs are available again. Sorry for the long delay. Remember, I'm only the front-man for a larger crew, so any downstream delays percolate up to my response. Hopefully this will remain good for some time, but I still don't understand why the condor server occasionally refuses to send out jobs in a timely manner.
ID: 41545 · Report as offensive     Reply Quote
Profile MAGIC Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 972
Credit: 41,078,544
RAC: 10,959
Message 41601 - Posted: 16 Feb 2020, 7:18:35 UTC - in response to Message 41545.  

Do you know what version of HTCondor is being used?
https://research.cs.wisc.edu/htcondor/
ID: 41601 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 470
Credit: 13,390,287
RAC: 13,481
Message 41603 - Posted: 16 Feb 2020, 15:16:28 UTC

All of my CMS just started erroring out. And it is not just the short ones with no work, but some have been running for 2 1/2 hours.
Something needs an upgrade again.
ID: 41603 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 724
Credit: 5,688,626
RAC: 716
Message 41634 - Posted: 19 Feb 2020, 13:57:19 UTC - in response to Message 41601.  

Do you know what version of HTCondor is being used?
https://research.cs.wisc.edu/htcondor/

If I do condor_q -v in the VM I get::
$CondorVersion: 8.6.10 Mar 12 2018 BuildID: 435200 $
$CondorPlatform: x86_64_RedHat6 $

but that's not necessarily what's running on vocms0267.cern.ch. I've asked Federica to check for me.
ID: 41634 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 724
Credit: 5,688,626
RAC: 716
Message 41636 - Posted: 19 Feb 2020, 14:15:28 UTC - in response to Message 41634.  

Do you know what version of HTCondor is being used?
https://research.cs.wisc.edu/htcondor/

If I do condor_q -v in the VM I get::
$CondorVersion: 8.6.10 Mar 12 2018 BuildID: 435200 $
$CondorPlatform: x86_64_RedHat6 $

but that's not necessarily what's running on vocms0267.cern.ch. I've asked Federica to check for me.

And the answer:
[drumroll]
cmst1@vocms0267:/data/srv/wmagent/current $ condor_version
$CondorVersion: 8.6.8 Nov 13 2017 BuildID: 424045 $
$CondorPlatform: x86_64_RedHat7 $
[/drumroll]
I wonder if that's optimal?...
ID: 41636 · Report as offensive     Reply Quote
Profile MAGIC Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 972
Credit: 41,078,544
RAC: 10,959
Message 41651 - Posted: 19 Feb 2020, 20:39:29 UTC - in response to Message 41636.  
Last modified: 19 Feb 2020, 20:41:02 UTC

Do you know what version of HTCondor is being used?
https://research.cs.wisc.edu/htcondor/

If I do condor_q -v in the VM I get::
$CondorVersion: 8.6.10 Mar 12 2018 BuildID: 435200 $
$CondorPlatform: x86_64_RedHat6 $

but that's not necessarily what's running on vocms0267.cern.ch. I've asked Federica to check for me.

And the answer:
[drumroll]
cmst1@vocms0267:/data/srv/wmagent/current $ condor_version
$CondorVersion: 8.6.8 Nov 13 2017 BuildID: 424045 $
$CondorPlatform: x86_64_RedHat7 $
[/drumroll]
I wonder if that's optimal?...


Well thanks for checking that Ivan and it sure is older than I expected and I thought they would keep that up to date at the server.
https://research.cs.wisc.edu/htcondor/downloads/
ID: 41651 · Report as offensive     Reply Quote
Peter Hucker

Send message
Joined: 12 Aug 06
Posts: 216
Credit: 1,183,520
RAC: 3,318
Message 41686 - Posted: 22 Feb 2020, 16:07:44 UTC

Most of my CMS are causing errors again. I'm assuming this is a CERN fault and not my doing. Please let me know if I can adjust anything at this end. Running latest Boinc and Virtualbox under Windows 10.

And I'm still not getting Atlas or Theory tasks, despite there being more of those showing as available on the server status page. I'm only being given CMS.
ID: 41686 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 470
Credit: 13,390,287
RAC: 13,481
Message 41687 - Posted: 22 Feb 2020, 16:18:49 UTC - in response to Message 41686.  

And I'm still not getting Atlas or Theory tasks, despite there being more of those showing as available on the server status page. I'm only being given CMS.

Ivan explained this. When CMS goes out, it takes the other ones with it too.
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5298&postid=41543#41543

I think if the LHC staff were paid by the number of BOINC units run, they would think of another way of doing it.
ID: 41687 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1553
Credit: 89,531,815
RAC: 123,576
Message 41688 - Posted: 22 Feb 2020, 17:22:31 UTC - in response to Message 41687.  

Ivan wrote about quotas.
These are set for each app version independent from other app versions.
Just check your computer details page and follow the link to "Application details: Show".

This means if you have CMS and ATLAS enabled and CMS fails until your computer's quota is down to 0 then it can still get ATLAS (if available).


ATM CMS stopped generation of more subtasks to find out what causes errors in the job submission chain.
See:
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5309&postid=41635
ID: 41688 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 470
Credit: 13,390,287
RAC: 13,481
Message 41689 - Posted: 22 Feb 2020, 17:58:11 UTC - in response to Message 41688.  
Last modified: 22 Feb 2020, 18:13:54 UTC

This means if you have CMS and ATLAS enabled and CMS fails until your computer's quota is down to 0 then it can still get ATLAS (if available).

Maybe I am just unlucky. But for some time (including the present), whenever CMS fails then I can't get more of anything else.
I am out of native ATLAS at the moment, even though they have all completed successfully.

EDIT: Running only native ATLAS (without CMS) usually works for a while. I detached a few hours ago, but reattached and will try again.
ID: 41689 · Report as offensive     Reply Quote
Peter Hucker

Send message
Joined: 12 Aug 06
Posts: 216
Credit: 1,183,520
RAC: 3,318
Message 41691 - Posted: 22 Feb 2020, 20:08:50 UTC - in response to Message 41689.  

I'd like to know how LHC servers (and other projects) decide what subproject to give you. If I have them all selected, I could understand getting the one with the biggest queue, or maybe first in first out, but with LHC at the moment, I got given loads of CMS and no Atlas or Theory, despite CMS having the least jobs available of the three. Maybe they prioritize one over the other? Maybe if very few people have CMS enabled, those that do just get CMS?
ID: 41691 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1553
Credit: 89,531,815
RAC: 123,576
Message 41692 - Posted: 22 Feb 2020, 20:58:59 UTC - in response to Message 41691.  

The search function might be your friend since this has been explained a couple of times in this message board.

The server fills it's ready to send queue from a couple of upstream processes each representing one of the subprojects.
Now the server's shared memory holds a list of "results" in random order.
In addition large projects like LHC@home spread the load over a couple of servers which are contacted in random order (DNS based load balancing).

Your client generates a request to get x seconds of work and the server that answers your request will send you the n first "results" from it's shared memory list.
- n is calculated based on the sum of the estimated runtimes.
- server side quotas will be respected.
- results from deselected subprojects will be skipped*).

Under certain circumstances this leads to a situation where one of the servers has no tasks (=result) from your active subprojects in it's queue and you will get a "no tasks available" message although the server status page show lots of available tasks.


*) This might lead to the situation that the next client who has this subproject checked will get the skipped "results".
ID: 41692 · Report as offensive     Reply Quote
Peter Hucker

Send message
Joined: 12 Aug 06
Posts: 216
Credit: 1,183,520
RAC: 3,318
Message 41693 - Posted: 22 Feb 2020, 21:07:11 UTC - in response to Message 41692.  

The search function might be your friend since this has been explained a couple of times in this message board.

The server fills it's ready to send queue from a couple of upstream processes each representing one of the subprojects.
Now the server's shared memory holds a list of "results" in random order.
In addition large projects like LHC@home spread the load over a couple of servers which are contacted in random order (DNS based load balancing).

Your client generates a request to get x seconds of work and the server that answers your request will send you the n first "results" from it's shared memory list.
- n is calculated based on the sum of the estimated runtimes.
- server side quotas will be respected.
- results from deselected subprojects will be skipped*).

Under certain circumstances this leads to a situation where one of the servers has no tasks (=result) from your active subprojects in it's queue and you will get a "no tasks available" message although the server status page show lots of available tasks.


*) This might lead to the situation that the next client who has this subproject checked will get the skipped "results".


I suspect that last point is why I only get CMS. A lot of folk have probably turned CMS off due to the problems, and I can see there are less "users in last 24 hours" on the server status page. Hence CMS is probably the first 50 tasks in the queues. I ain't turning my CMS off. I don't care if I get no credit, if it helps them sort out the problems, my computer will try to do them. Some of them work.
ID: 41693 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 28 Sep 04
Posts: 476
Credit: 25,246,301
RAC: 14,740
Message 41694 - Posted: 22 Feb 2020, 22:33:06 UTC - in response to Message 41693.  

Today I have been getting only Theory on my main cruncher. It has been now accepting work from all subprojects except sixtrack. Yesterday it had all subprojects selected and it got only sixtrack tasks. So no CMS for me although it has also been selected for a couple of days now.
ID: 41694 · Report as offensive     Reply Quote
Peter Hucker

Send message
Joined: 12 Aug 06
Posts: 216
Credit: 1,183,520
RAC: 3,318
Message 41695 - Posted: 22 Feb 2020, 22:43:29 UTC - in response to Message 41694.  

Today I have been getting only Theory on my main cruncher. It has been now accepting work from all subprojects except sixtrack. Yesterday it had all subprojects selected and it got only sixtrack tasks. So no CMS for me although it has also been selected for a couple of days now.


Seems like the server is picking favourites :-)

Sixtrack is very short of tasks, server status usually shows 0 available. One of my computers managed to grab 15 of them last night, but that's all. Sixtrack is the only one that will work without virtual machine, so anyone can do it, including mobile phones (which I have two of), and my three antique computers with very old processors and small RAM.
ID: 41695 · Report as offensive     Reply Quote

Message boards : News : CMS@Home up again


©2020 CERN