Message boards : CMS Application : CMS@Home difficulties in attempts to prepare for multi-core jobs
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · Next

AuthorMessage
maeax

Send message
Joined: 2 May 07
Posts: 2132
Credit: 160,325,691
RAC: 34,538
Message 50061 - Posted: 28 Apr 2024, 5:52:23 UTC - in response to Message 50057.  

Run time 5 hours 58 min 38 sec
CPU time 14 hours 40 min 18 sec
Validate state Valid
Credit 3.56

Run time 3 hours 9 min 48 sec
CPU time 5 hours 46 min 57 sec
Validate state Valid
Credit 1.88

Run time 5 hours 50 min 56 sec
CPU time 14 hours 38 min 39 sec
Validate state Valid
Credit 3.63

Excuse me, but what??

Seeing the same. The CMS Tasks are more important.
Boinc have this Systemerror (Is it an Error?) ever.
ID: 50061 · Report as offensive     Reply Quote
Dark Angel
Avatar

Send message
Joined: 7 Aug 11
Posts: 93
Credit: 21,938,418
RAC: 5,111
Message 50063 - Posted: 28 Apr 2024, 6:24:55 UTC

Run time 13 hours 22 min 46 sec
CPU time 1 days 22 hours 56 min 57 sec
Validate state Valid
Credit 7.75

Excuse me CMS admin? A moment of your time?

What the hell??
ID: 50063 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2132
Credit: 160,325,691
RAC: 34,538
Message 50065 - Posted: 28 Apr 2024, 6:35:10 UTC - in response to Message 50063.  

It's not the CMS-Team, for this Creditpoints.
This is the Boinc-System. You can search in the folders,
a lot of messages for it are present.
ID: 50065 · Report as offensive     Reply Quote
Dark Angel
Avatar

Send message
Joined: 7 Aug 11
Posts: 93
Credit: 21,938,418
RAC: 5,111
Message 50067 - Posted: 28 Apr 2024, 7:05:46 UTC - in response to Message 50065.  

That's a cop-out. Every other project is awarding points correctly. If it were not CMS related then Atlas and Theory would be showing the same issues. If it were VBox related then other VBox projects would show the same issues. They don't.
ID: 50067 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2439
Credit: 229,911,272
RAC: 134,230
Message 50068 - Posted: 28 Apr 2024, 7:40:25 UTC - in response to Message 50067.  

Every now and then somebody complains about low credits (weird, nobody ever complains about too much credits, lol).

The answer is always the same:
Credit calculation is built into BOINC.
LHC@home does not change the relevant code parts as they also affect other things, e.g. work fetch calculation.

To understand how it works, see:
https://boinc.berkeley.edu/trac/wiki/CreditNew

Change requests have to be made here:
https://github.com/BOINC/boinc
ID: 50068 · Report as offensive     Reply Quote
Dark Angel
Avatar

Send message
Joined: 7 Aug 11
Posts: 93
Credit: 21,938,418
RAC: 5,111
Message 50069 - Posted: 28 Apr 2024, 8:49:23 UTC
Last modified: 28 Apr 2024, 8:50:25 UTC

So people are getting punished (host punishment, discussed on the linked page) for having to abort huge numbers of single core tasks that were never going to do actual work because the CMS team couldn't be arsed clearing out the work caches.
People punished because you lot didn't do your jobs.
And you're sneering down on people who don't appreciate being screwed around.
Real nice.

And the credit system used is a CHOICE made by each project. There are options. Bleating that the users need to complain elsewhere is another cop-out.
ID: 50069 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2439
Credit: 229,911,272
RAC: 134,230
Message 50070 - Posted: 28 Apr 2024, 9:21:57 UTC - in response to Message 50069.  

Stop that kind of comments!

You have been told the facts.
Blaming people here for things you don't understand or accept is not respectful nor does it solve your complaint.
ID: 50070 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1302
Credit: 8,664,918
RAC: 6,783
Message 50105 - Posted: 2 May 2024, 12:29:15 UTC - in response to Message 50025.  

Since this morning we're getting tasks of a new batch created by Ivan.
For me it seems that these are single core jobs.
At least the first job this morning in that Virtual Machine was using 4 threads and this afternoon the second job uses only 1 thread - cmsRun 100%
ID: 50105 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2132
Credit: 160,325,691
RAC: 34,538
Message 50107 - Posted: 3 May 2024, 8:16:23 UTC - in response to Message 50105.  
Last modified: 3 May 2024, 8:41:33 UTC

2024-05-03 10:06:33 (21200): Setting CPU Count for VM. (4)
For me 4-Core.
.vdi differentiell 2 MByte used from 20 GByte.
After half an hour:
Running job output should appear here.
No Job inside the Task seeing.
Properties of Boinc-Task show this:
Prozessorzeit
01:01:30
Prozessor-Zeit seit dem letzten Checkpoint
00:52:12
Seem to work.
ID: 50107 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2439
Credit: 229,911,272
RAC: 134,230
Message 50136 - Posted: 6 May 2024, 17:51:19 UTC

@Ivan
Please explain what's going on.

The last CMS batch before the WMAgent update was a 4-core batch.
Lots of volunteer machines are now configured to run 4-core jobs.

First batch after the upgrade is a 1-core batch but 4-core VMs get only 2 jobs per VM.
This results in 2 idle cores per VM that can't be given back to BOINC for other work.
ID: 50136 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2439
Credit: 229,911,272
RAC: 134,230
Message 50139 - Posted: 6 May 2024, 18:47:48 UTC

Although BOINC shows valid tasks CERN Grafana shows 100% failure rate for the current CMS batch.
ID: 50139 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1009
Credit: 6,313,488
RAC: 1,377
Message 50142 - Posted: 7 May 2024, 13:30:44 UTC - in response to Message 50139.  

Although BOINC shows valid tasks CERN Grafana shows 100% failure rate for the current CMS batch.

Unfortunately, there has been an incompatability (maybe several) introduced with the WMAgent upgrade. We're trying to understand it/them so there may not be many jobs submitted until we get a handle on the changes.
We've also been trying to understand whether it's possible to have single- and quad-core jobs in the queue simultaneously -- hence the number of small single-core workflows a couple of days ago. I think it's going to be hard to come to a consensus on this, but we are racking our brains...
ID: 50142 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2439
Credit: 229,911,272
RAC: 134,230
Message 50151 - Posted: 9 May 2024, 7:53:54 UTC - in response to Message 50139.  

Although BOINC shows valid tasks CERN Grafana shows 100% failure rate for the current CMS batch.

Still 100% failure rate.
Even with the recent 4-core batch.
ID: 50151 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1701
Credit: 105,663,600
RAC: 72,874
Message 50193 - Posted: 17 May 2024, 13:35:45 UTC
Last modified: 17 May 2024, 13:37:31 UTC

tasks now fail after about 23 minutes, excerpt from stderr:

<message>
Die Platzhalterzeichen f�r Dateinamen (* oder ?) wurden falsch eingegeben, oder es wurden zu viele Platzhalterzeichen angegeben.
(0xd0) - exit code 208 (0xd0)</message>

https://lhcathome.cern.ch/lhcathome/result.php?resultid=411165871

I remember that we had exactly this kind of error several months ago. But I do not remember what was the exact reason.

Grafana shows that this problem started about 10:30 today.
ID: 50193 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1302
Credit: 8,664,918
RAC: 6,783
Message 50194 - Posted: 17 May 2024, 14:02:58 UTC - in response to Message 50193.  

Probably no sub-tasks for CMS available, I suppose.
Normally the unsent number of BOINC CMS-envelope tasks should return to zero, but we know that it's not always working perfect.
ID: 50194 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1701
Credit: 105,663,600
RAC: 72,874
Message 50195 - Posted: 17 May 2024, 14:22:00 UTC - in response to Message 50194.  
Last modified: 17 May 2024, 14:26:05 UTC

Probably no sub-tasks for CMS available, I suppose.
In case of no sub-tasks, the situation is a little different: the task also finishes after about 25/30 minutes, and even yields a few credit points. Plus in the tasks list, the status does not say "computation error"; also, there is no error message in stderr like the one I cited.
So I am sure the problem is a different one.
ID: 50195 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1133
Credit: 49,902,686
RAC: 5,088
Message 50199 - Posted: 17 May 2024, 20:01:51 UTC - in response to Message 50195.  

30 of these as usual in the only hours I actually get to sleep
https://lhcathome.cern.ch/lhcathome/result.php?resultid=411172457
ID: 50199 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1302
Credit: 8,664,918
RAC: 6,783
Message 50201 - Posted: 18 May 2024, 7:38:31 UTC - in response to Message 50195.  
Last modified: 18 May 2024, 7:41:01 UTC

So I am sure the problem is a different one.
OK . . .,
but in the past no credits were given, when the very first sub-job for a task did not make it to the VM.
ID: 50201 · Report as offensive     Reply Quote
FanzaFede
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 19 Jul 18
Posts: 5
Credit: 130,305
RAC: 52
Message 50202 - Posted: 18 May 2024, 9:07:24 UTC

Hi all,
yes there is a problem with a configuration file we have modified and the CMS pilot job doesn't allow the connection of VMs to the condor pool.
We let you know as soon as possible.
Cheers and sorry,
Federica
ID: 50202 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1701
Credit: 105,663,600
RAC: 72,874
Message 50203 - Posted: 18 May 2024, 11:21:14 UTC

Federica, thanks for the information.

Wouldn't it make sense to stop sending tasks until the problem will be solved?
ID: 50203 · Report as offensive     Reply Quote
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · Next

Message boards : CMS Application : CMS@Home difficulties in attempts to prepare for multi-core jobs


©2024 CERN