Message boards :
CMS Application :
CMS@Home difficulties in attempts to prepare for multi-core jobs
Message board moderation
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · Next
Author | Message |
---|---|
Send message Joined: 2 May 07 Posts: 2244 Credit: 173,902,375 RAC: 677 |
Run time 5 hours 58 min 38 sec Seeing the same. The CMS Tasks are more important. Boinc have this Systemerror (Is it an Error?) ever. |
Send message Joined: 7 Aug 11 Posts: 104 Credit: 25,221,969 RAC: 25,711 |
Run time 13 hours 22 min 46 sec CPU time 1 days 22 hours 56 min 57 sec Validate state Valid Credit 7.75 Excuse me CMS admin? A moment of your time? What the hell?? |
Send message Joined: 2 May 07 Posts: 2244 Credit: 173,902,375 RAC: 677 |
It's not the CMS-Team, for this Creditpoints. This is the Boinc-System. You can search in the folders, a lot of messages for it are present. |
Send message Joined: 7 Aug 11 Posts: 104 Credit: 25,221,969 RAC: 25,711 |
That's a cop-out. Every other project is awarding points correctly. If it were not CMS related then Atlas and Theory would be showing the same issues. If it were VBox related then other VBox projects would show the same issues. They don't. |
Send message Joined: 15 Jun 08 Posts: 2541 Credit: 254,608,838 RAC: 56,545 |
Every now and then somebody complains about low credits (weird, nobody ever complains about too much credits, lol). The answer is always the same: Credit calculation is built into BOINC. LHC@home does not change the relevant code parts as they also affect other things, e.g. work fetch calculation. To understand how it works, see: https://boinc.berkeley.edu/trac/wiki/CreditNew Change requests have to be made here: https://github.com/BOINC/boinc |
Send message Joined: 7 Aug 11 Posts: 104 Credit: 25,221,969 RAC: 25,711 |
So people are getting punished (host punishment, discussed on the linked page) for having to abort huge numbers of single core tasks that were never going to do actual work because the CMS team couldn't be arsed clearing out the work caches. People punished because you lot didn't do your jobs. And you're sneering down on people who don't appreciate being screwed around. Real nice. And the credit system used is a CHOICE made by each project. There are options. Bleating that the users need to complain elsewhere is another cop-out. |
Send message Joined: 15 Jun 08 Posts: 2541 Credit: 254,608,838 RAC: 56,545 |
Stop that kind of comments! You have been told the facts. Blaming people here for things you don't understand or accept is not respectful nor does it solve your complaint. |
Send message Joined: 14 Jan 10 Posts: 1422 Credit: 9,484,585 RAC: 1,882 |
Since this morning we're getting tasks of a new batch created by Ivan. For me it seems that these are single core jobs. At least the first job this morning in that Virtual Machine was using 4 threads and this afternoon the second job uses only 1 thread - cmsRun 100% |
Send message Joined: 2 May 07 Posts: 2244 Credit: 173,902,375 RAC: 677 |
2024-05-03 10:06:33 (21200): Setting CPU Count for VM. (4) For me 4-Core. .vdi differentiell 2 MByte used from 20 GByte. After half an hour: Running job output should appear here. No Job inside the Task seeing. Properties of Boinc-Task show this: Prozessorzeit 01:01:30 Prozessor-Zeit seit dem letzten Checkpoint 00:52:12 Seem to work. |
Send message Joined: 15 Jun 08 Posts: 2541 Credit: 254,608,838 RAC: 56,545 |
@Ivan Please explain what's going on. The last CMS batch before the WMAgent update was a 4-core batch. Lots of volunteer machines are now configured to run 4-core jobs. First batch after the upgrade is a 1-core batch but 4-core VMs get only 2 jobs per VM. This results in 2 idle cores per VM that can't be given back to BOINC for other work. |
Send message Joined: 15 Jun 08 Posts: 2541 Credit: 254,608,838 RAC: 56,545 |
Although BOINC shows valid tasks CERN Grafana shows 100% failure rate for the current CMS batch. |
Send message Joined: 29 Aug 05 Posts: 1061 Credit: 7,737,455 RAC: 443 |
Although BOINC shows valid tasks CERN Grafana shows 100% failure rate for the current CMS batch. Unfortunately, there has been an incompatability (maybe several) introduced with the WMAgent upgrade. We're trying to understand it/them so there may not be many jobs submitted until we get a handle on the changes. We've also been trying to understand whether it's possible to have single- and quad-core jobs in the queue simultaneously -- hence the number of small single-core workflows a couple of days ago. I think it's going to be hard to come to a consensus on this, but we are racking our brains... |
Send message Joined: 15 Jun 08 Posts: 2541 Credit: 254,608,838 RAC: 56,545 |
Although BOINC shows valid tasks CERN Grafana shows 100% failure rate for the current CMS batch. Still 100% failure rate. Even with the recent 4-core batch. |
Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,922,069 RAC: 33,209 |
tasks now fail after about 23 minutes, excerpt from stderr: <message> Die Platzhalterzeichen f�r Dateinamen (* oder ?) wurden falsch eingegeben, oder es wurden zu viele Platzhalterzeichen angegeben. (0xd0) - exit code 208 (0xd0)</message> https://lhcathome.cern.ch/lhcathome/result.php?resultid=411165871 I remember that we had exactly this kind of error several months ago. But I do not remember what was the exact reason. Grafana shows that this problem started about 10:30 today. |
Send message Joined: 14 Jan 10 Posts: 1422 Credit: 9,484,585 RAC: 1,882 |
Probably no sub-tasks for CMS available, I suppose. Normally the unsent number of BOINC CMS-envelope tasks should return to zero, but we know that it's not always working perfect. |
Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,922,069 RAC: 33,209 |
Probably no sub-tasks for CMS available, I suppose.In case of no sub-tasks, the situation is a little different: the task also finishes after about 25/30 minutes, and even yields a few credit points. Plus in the tasks list, the status does not say "computation error"; also, there is no error message in stderr like the one I cited. So I am sure the problem is a different one. |
Send message Joined: 24 Oct 04 Posts: 1174 Credit: 54,887,670 RAC: 8,563 |
30 of these as usual in the only hours I actually get to sleep https://lhcathome.cern.ch/lhcathome/result.php?resultid=411172457 |
Send message Joined: 14 Jan 10 Posts: 1422 Credit: 9,484,585 RAC: 1,882 |
So I am sure the problem is a different one.OK . . ., but in the past no credits were given, when the very first sub-job for a task did not make it to the VM. |
Send message Joined: 19 Jul 18 Posts: 5 Credit: 313,989 RAC: 29 |
Hi all, yes there is a problem with a configuration file we have modified and the CMS pilot job doesn't allow the connection of VMs to the condor pool. We let you know as soon as possible. Cheers and sorry, Federica |
Send message Joined: 18 Dec 15 Posts: 1821 Credit: 118,922,069 RAC: 33,209 |
Federica, thanks for the information. Wouldn't it make sense to stop sending tasks until the problem will be solved? |
©2024 CERN