Thread 'CMS@Home difficulties in attempts to prepare for multi-core jobs'

Author	Message
maeax Send message Joined: 2 May 07 Posts: 2304 Credit: 179,727,092 RAC: 17,509	Message 50061 - Posted: 28 Apr 2024, 5:52:23 UTC - in response to Message 50057. Run time 5 hours 58 min 38 sec CPU time 14 hours 40 min 18 sec Validate state Valid Credit 3.56 Run time 3 hours 9 min 48 sec CPU time 5 hours 46 min 57 sec Validate state Valid Credit 1.88 Run time 5 hours 50 min 56 sec CPU time 14 hours 38 min 39 sec Validate state Valid Credit 3.63 Excuse me, but what?? Seeing the same. The CMS Tasks are more important. Boinc have this Systemerror (Is it an Error?) ever. ID: 50061 · Reply Quote

Dark Angel Send message Joined: 7 Aug 11 Posts: 122 Credit: 34,229,089 RAC: 17,142	Message 50063 - Posted: 28 Apr 2024, 6:24:55 UTC Run time 13 hours 22 min 46 sec CPU time 1 days 22 hours 56 min 57 sec Validate state Valid Credit 7.75 Excuse me CMS admin? A moment of your time? What the hell?? ID: 50063 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2304 Credit: 179,727,092 RAC: 17,509	Message 50065 - Posted: 28 Apr 2024, 6:35:10 UTC - in response to Message 50063. It's not the CMS-Team, for this Creditpoints. This is the Boinc-System. You can search in the folders, a lot of messages for it are present. ID: 50065 · Reply Quote

Dark Angel Send message Joined: 7 Aug 11 Posts: 122 Credit: 34,229,089 RAC: 17,142	Message 50067 - Posted: 28 Apr 2024, 7:05:46 UTC - in response to Message 50065. That's a cop-out. Every other project is awarding points correctly. If it were not CMS related then Atlas and Theory would be showing the same issues. If it were VBox related then other VBox projects would show the same issues. They don't. ID: 50067 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2753 Credit: 304,196,659 RAC: 114,986	Message 50068 - Posted: 28 Apr 2024, 7:40:25 UTC - in response to Message 50067. Every now and then somebody complains about low credits (weird, nobody ever complains about too much credits, lol). The answer is always the same: Credit calculation is built into BOINC. LHC@home does not change the relevant code parts as they also affect other things, e.g. work fetch calculation. To understand how it works, see: https://boinc.berkeley.edu/trac/wiki/CreditNew Change requests have to be made here: https://github.com/BOINC/boinc ID: 50068 · Reply Quote

Dark Angel Send message Joined: 7 Aug 11 Posts: 122 Credit: 34,229,089 RAC: 17,142	Message 50069 - Posted: 28 Apr 2024, 8:49:23 UTC Last modified: 28 Apr 2024, 8:50:25 UTC So people are getting punished (host punishment, discussed on the linked page) for having to abort huge numbers of single core tasks that were never going to do actual work because the CMS team couldn't be arsed clearing out the work caches. People punished because you lot didn't do your jobs. And you're sneering down on people who don't appreciate being screwed around. Real nice. And the credit system used is a CHOICE made by each project. There are options. Bleating that the users need to complain elsewhere is another cop-out. ID: 50069 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2753 Credit: 304,196,659 RAC: 114,986	Message 50070 - Posted: 28 Apr 2024, 9:21:57 UTC - in response to Message 50069. Stop that kind of comments! You have been told the facts. Blaming people here for things you don't understand or accept is not respectful nor does it solve your complaint. ID: 50070 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1556 Credit: 10,101,515 RAC: 1,464	Message 50105 - Posted: 2 May 2024, 12:29:15 UTC - in response to Message 50025. Since this morning we're getting tasks of a new batch created by Ivan. For me it seems that these are single core jobs. At least the first job this morning in that Virtual Machine was using 4 threads and this afternoon the second job uses only 1 thread - cmsRun 100% ID: 50105 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2304 Credit: 179,727,092 RAC: 17,509	Message 50107 - Posted: 3 May 2024, 8:16:23 UTC - in response to Message 50105. Last modified: 3 May 2024, 8:41:33 UTC 2024-05-03 10:06:33 (21200): Setting CPU Count for VM. (4) For me 4-Core. .vdi differentiell 2 MByte used from 20 GByte. After half an hour: Running job output should appear here. No Job inside the Task seeing. Properties of Boinc-Task show this: Prozessorzeit 01:01:30 Prozessor-Zeit seit dem letzten Checkpoint 00:52:12 Seem to work. ID: 50107 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2753 Credit: 304,196,659 RAC: 114,986	Message 50136 - Posted: 6 May 2024, 17:51:19 UTC @Ivan Please explain what's going on. The last CMS batch before the WMAgent update was a 4-core batch. Lots of volunteer machines are now configured to run 4-core jobs. First batch after the upgrade is a 1-core batch but 4-core VMs get only 2 jobs per VM. This results in 2 idle cores per VM that can't be given back to BOINC for other work. ID: 50136 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2753 Credit: 304,196,659 RAC: 114,986	Message 50139 - Posted: 6 May 2024, 18:47:48 UTC Although BOINC shows valid tasks CERN Grafana shows 100% failure rate for the current CMS batch. ID: 50139 · Reply Quote

ivan Volunteer moderator Project tester Volunteer developer Volunteer tester Project scientist Send message Joined: 29 Aug 05 Posts: 1158 Credit: 11,867,259 RAC: 7,384	Message 50142 - Posted: 7 May 2024, 13:30:44 UTC - in response to Message 50139. Although BOINC shows valid tasks CERN Grafana shows 100% failure rate for the current CMS batch. Unfortunately, there has been an incompatability (maybe several) introduced with the WMAgent upgrade. We're trying to understand it/them so there may not be many jobs submitted until we get a handle on the changes. We've also been trying to understand whether it's possible to have single- and quad-core jobs in the queue simultaneously -- hence the number of small single-core workflows a couple of days ago. I think it's going to be hard to come to a consensus on this, but we are racking our brains... ID: 50142 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2753 Credit: 304,196,659 RAC: 114,986	Message 50151 - Posted: 9 May 2024, 7:53:54 UTC - in response to Message 50139. Although BOINC shows valid tasks CERN Grafana shows 100% failure rate for the current CMS batch. Still 100% failure rate. Even with the recent 4-core batch. ID: 50151 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1984 Credit: 162,123,273 RAC: 86,746	Message 50193 - Posted: 17 May 2024, 13:35:45 UTC Last modified: 17 May 2024, 13:37:31 UTC tasks now fail after about 23 minutes, excerpt from stderr: <message> Die Platzhalterzeichen fï¿½r Dateinamen (* oder ?) wurden falsch eingegeben, oder es wurden zu viele Platzhalterzeichen angegeben. (0xd0) - exit code 208 (0xd0)</message> https://lhcathome.cern.ch/lhcathome/result.php?resultid=411165871 I remember that we had exactly this kind of error several months ago. But I do not remember what was the exact reason. Grafana shows that this problem started about 10:30 today. ID: 50193 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1556 Credit: 10,101,515 RAC: 1,464	Message 50194 - Posted: 17 May 2024, 14:02:58 UTC - in response to Message 50193. Probably no sub-tasks for CMS available, I suppose. Normally the unsent number of BOINC CMS-envelope tasks should return to zero, but we know that it's not always working perfect. ID: 50194 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1984 Credit: 162,123,273 RAC: 86,746	Message 50195 - Posted: 17 May 2024, 14:22:00 UTC - in response to Message 50194. Last modified: 17 May 2024, 14:26:05 UTC Probably no sub-tasks for CMS available, I suppose. In case of no sub-tasks, the situation is a little different: the task also finishes after about 25/30 minutes, and even yields a few credit points. Plus in the tasks list, the status does not say "computation error"; also, there is no error message in stderr like the one I cited. So I am sure the problem is a different one. ID: 50195 · Reply Quote

Magic Quantum Mechanic Send message Joined: 24 Oct 04 Posts: 1311 Credit: 97,630,829 RAC: 106,145	Message 50199 - Posted: 17 May 2024, 20:01:51 UTC - in response to Message 50195. 30 of these as usual in the only hours I actually get to sleep https://lhcathome.cern.ch/lhcathome/result.php?resultid=411172457 ID: 50199 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1556 Credit: 10,101,515 RAC: 1,464	Message 50201 - Posted: 18 May 2024, 7:38:31 UTC - in response to Message 50195. Last modified: 18 May 2024, 7:41:01 UTC So I am sure the problem is a different one. OK . . ., but in the past no credits were given, when the very first sub-job for a task did not make it to the VM. ID: 50201 · Reply Quote

FanzaFede Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 19 Jul 18 Posts: 7 Credit: 338,972 RAC: 0	Message 50202 - Posted: 18 May 2024, 9:07:24 UTC Hi all, yes there is a problem with a configuration file we have modified and the CMS pilot job doesn't allow the connection of VMs to the condor pool. We let you know as soon as possible. Cheers and sorry, Federica ID: 50202 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1984 Credit: 162,123,273 RAC: 86,746	Message 50203 - Posted: 18 May 2024, 11:21:14 UTC Federica, thanks for the information. Wouldn't it make sense to stop sending tasks until the problem will be solved? ID: 50203 · Reply Quote