Message boards : CMS Application : CMS@Home difficulties in attempts to prepare for multi-core jobs
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · Next

AuthorMessage
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1173
Credit: 54,814,776
RAC: 15,804
Message 50204 - Posted: 18 May 2024, 19:22:33 UTC - in response to Message 50203.  


Wouldn't it make sense to stop sending tasks until the problem will be solved?


I agree with that Erich and that we do get a Federica message before starting up again instead of a wild guess
ID: 50204 · Report as offensive     Reply Quote
FanzaFede
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 19 Jul 18
Posts: 5
Credit: 313,989
RAC: 85
Message 50214 - Posted: 20 May 2024, 7:27:11 UTC - in response to Message 50204.  

Hi all,
sorry for the late answer. I have aborted just now the last workflow submitted on Friday night.
The problem is without sending tasks we can not test changes and verify if we fixed the problem or not.
So we will send a new tasks when we make a new change in configuration files . We let you know.
Federica
ID: 50214 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1814
Credit: 118,463,293
RAC: 29,862
Message 50271 - Posted: 28 May 2024, 18:59:33 UTC

during the past hours, the BOINC event log showed the following when trying to download CMS tasks:

28.05.2024 20:47:49 | LHC@home | Sending scheduler request: To fetch work.
28.05.2024 20:47:49 | LHC@home | Requesting new tasks for CPU
28.05.2024 20:47:50 | LHC@home | Scheduler request completed: got 0 new tasks
28.05.2024 20:47:50 | LHC@home | Didn't resend lost task MqIODm8WbS5nsSi4apGgGQJmABFKDmABFKDm4ySLDmSpSKDmbdCSCo_0 (expired)
28.05.2024 20:47:50 | LHC@home | Didn't resend lost task JbLMDm8fzS5n9Rq4apoT9bVoABFKDmABFKDmlqFKDm1zMKDmc1sO4m_0 (expired)
28.05.2024 20:47:50 | LHC@home | Didn't resend lost task ie1MDm212S5nsSi4apGgGQJmABFKDmABFKDm4ySLDmFpUKDmdsp22n_0 (expired)
28.05.2024 20:47:50 | LHC@home | Didn't resend lost task B6CNDmwJ7S5nsSi4apGgGQJmABFKDmABFKDm4ySLDmlCVKDmRuwkIm_0 (expired)
28.05.2024 20:47:50 | LHC@home | Didn't resend lost task 6EoNDm4z8S5n9Rq4apoT9bVoABFKDmABFKDmlqFKDmtNNKDmsLoG1m_0 (expired)
28.05.2024 20:47:50 | LHC@home | No tasks sent

instead of just saying "no tasks are available for CMS" (at least I guess that this is the case as the project status page shows zero unsent tasks).
ID: 50271 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1814
Credit: 118,463,293
RAC: 29,862
Message 50276 - Posted: 29 May 2024, 6:09:46 UTC

re my above posting from yesterday evening:

I am realizing only now that the task names mentioned in the BOINC event log are for ATLAS tasks, NOT CMS tasks.
This makes everything even more weird.
ID: 50276 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1173
Credit: 54,814,776
RAC: 15,804
Message 50336 - Posted: 5 Jun 2024, 8:17:38 UTC

Since I did that new version of Boinc d/l (8.0.2) it started (only 2 so far) doing these strange things but I have another one running to see if I can get 3 Invalids in a row

https://lhcathome.cern.ch/lhcathome/result.php?resultid=411570028
https://lhcathome.cern.ch/lhcathome/result.php?resultid=411575125
ID: 50336 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1814
Credit: 118,463,293
RAC: 29,862
Message 50342 - Posted: 5 Jun 2024, 11:19:38 UTC - in response to Message 50336.  

Since I did that new version of Boinc d/l (8.0.2) it started (only 2 so far) doing these strange things but I have another one running to see if I can get 3 Invalids in a row

https://lhcathome.cern.ch/lhcathome/result.php?resultid=411570028
https://lhcathome.cern.ch/lhcathome/result.php?resultid=411575125
interesting information in stderr:

2024-06-04 22:09:11 (2016): VM is no longer is a running state. It is in 'stuck'.
2024-06-04 22:09:11 (2016): VM state change detected. (old = 'running', new = 'stuck')

whatever this means ... ?
ID: 50342 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1060
Credit: 7,737,455
RAC: 1,317
Message 50348 - Posted: 6 Jun 2024, 11:53:55 UTC

Having said that I'm only going to submit 4-core jobs in the near future, today I discovered that at least one of our scripts (the one that updates the website with our logs, under "Show graphics" in boincmgr) doesn't work with multicore VMs. So I'm going to submit a small batch of single-core jobs in the hope that I can catch a task and check if the script works with it.
For preference, don't set your hosts to run single-core tasks while I do these checks.
ID: 50348 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2243
Credit: 173,902,375
RAC: 2,013
Message 50349 - Posted: 6 Jun 2024, 12:45:39 UTC
Last modified: 6 Jun 2024, 13:15:57 UTC

06.06.2024 14:43:29 | LHC@home | No tasks are available for CMS Simulation
prefs:Maximale Anzahl Aufgaben 2
Maximale Anzahl CPUs 4
Boinc 8.0.2 - Win11pro
------------------------------- An other PC:
https://lhcathome.cern.ch/lhcathome/results.php?hostid=10797673
prefs:Maximale Anzahl Aufgaben 2
Maximale Anzahl CPUs 4
Boinc 7.24.1 - Win11pro
ID: 50349 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1060
Credit: 7,737,455
RAC: 1,317
Message 50350 - Posted: 6 Jun 2024, 13:51:23 UTC

Dang! Looks like someone got in ahead of me -- all 100 tasks have been snaffled up before I could get a look in...
However, Laurence has already patched the script, so we'll see how it goes with further new tasks.
ID: 50350 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2243
Credit: 173,902,375
RAC: 2,013
Message 50351 - Posted: 6 Jun 2024, 14:24:05 UTC - in response to Message 50350.  

Next two started, now 4 multicore CMS running:
https://lhcathome.cern.ch/lhcathome/results.php?hostid=10797673
ID: 50351 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2534
Credit: 253,822,106
RAC: 36,869
Message 50352 - Posted: 6 Jun 2024, 14:52:50 UTC - in response to Message 50350.  

Got 2 task.
Each running 2 cmsRun instances (obviously singlecore) side by side in a 4-core VM.

The modified "watch_logs" script does not work.
Most likely the logs it tries to link into the apache folder do not exist in the job folder.
ID: 50352 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1060
Credit: 7,737,455
RAC: 1,317
Message 50354 - Posted: 6 Jun 2024, 22:17:16 UTC - in response to Message 50352.  

Got 2 task.
Each running 2 cmsRun instances (obviously singlecore) side by side in a 4-core VM.

The modified "watch_logs" script does not work.
Most likely the logs it tries to link into the apache folder do not exist in the job folder.

Exactly right; the script was looking for StarterLog.slot1, but the multicore jobs use
StarterLog.slot1_1. A simple wildcard addition cleared that hurdle at least.
ID: 50354 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2534
Credit: 253,822,106
RAC: 36,869
Message 50355 - Posted: 7 Jun 2024, 7:06:03 UTC - in response to Message 50354.  

Got 2 task.
Each running 2 cmsRun instances (obviously singlecore) side by side in a 4-core VM.

The modified "watch_logs" script does not work.
Most likely the logs it tries to link into the apache folder do not exist in the job folder.

Exactly right; the script was looking for StarterLog.slot1, but the multicore jobs use
StarterLog.slot1_1. A simple wildcard addition cleared that hurdle at least.


Right, the script can now find a log like "StarterLog.slot1_1":
log_file=$(find /tmp/glide_* -name StarterLog.slot1* | head -n 1)

But it still creates a link to "StarterLog" which does not exist:
  ln -Pf ${log_dir}/StarterLog /var/www/html/logs/StarterLog

The responsible script line should be changed to something like this:
  ln -Pf ${log_dir}/$(basename ${log_file}) /var/www/html/logs/StarterLog
ID: 50355 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1418
Credit: 9,460,759
RAC: 2,399
Message 50356 - Posted: 7 Jun 2024, 11:55:47 UTC - in response to Message 50355.  

The running.log (ALT-F2) of the active sub-task is empty until the job has finished. Then it's filled suddenly with the whole ~144.105 lines of text of the finished job.
When a new sub-task starts the running.log is emptied and copied to finished_#.log
ID: 50356 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1060
Credit: 7,737,455
RAC: 1,317
Message 50357 - Posted: 7 Jun 2024, 18:16:59 UTC - in response to Message 50356.  

The running.log (ALT-F2) of the active sub-task is empty until the job has finished. Then it's filled suddenly with the whole ~144.105 lines of text of the finished job.
When a new sub-task starts the running.log is emptied and copied to finished_#.log

Yes, it didn't used to be like that. My best guess at the moment is thar cmsRub behaviour has changed and the logs aren't written until the job ends.

Are we actually getting more than one job per task nowadays? From my computer times, I'm thinking not. I could halve the job time to check, but that may get up some people's noses.
ID: 50357 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1418
Credit: 9,460,759
RAC: 2,399
Message 50358 - Posted: 7 Jun 2024, 20:28:48 UTC - in response to Message 50357.  
Last modified: 8 Jun 2024, 5:55:13 UTC

Are we actually getting more than one job per task nowadays? From my computer times, I'm thinking not. I could halve the job time to check, but that may get up some people's noses.
I started this morning 2 multi-core tasks on hostid 10690380 and both tasks have ran 4 jobs inside the VM.
First finished task did 12 sequencies of 4 jobs: https://lhcathome.cern.ch/lhcathome/result.php?resultid=411700593
The second will finish soon with done 13 sequencies of 4 jobs: https://lhcathome.cern.ch/lhcathome/result.php?resultid=411701063
At least that's my understanding of a 4 multi core, that the task is doing 4 jobs concurrently and not 1 job divided over 4 threads.

Edit: The second task even after 13.5 hours requested a new sub-task and did not shutdown by the application, what normally is done after > 12 hours runtime.
So I decided to shutdown the task gracefully to be able to shutdown the host for its overnight rest.
ID: 50358 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1060
Credit: 7,737,455
RAC: 1,317
Message 50359 - Posted: 8 Jun 2024, 9:24:02 UTC - in response to Message 50358.  

Are we actually getting more than one job per task nowadays? From my computer times, I'm thinking not. I could halve the job time to check, but that may get up some people's noses.
I started this morning 2 multi-core tasks on hostid 10690380 and both tasks have ran 4 jobs inside the VM.
First finished task did 12 sequencies of 4 jobs: https://lhcathome.cern.ch/lhcathome/result.php?resultid=411700593
The second will finish soon with done 13 sequencies of 4 jobs: https://lhcathome.cern.ch/lhcathome/result.php?resultid=411701063
At least that's my understanding of a 4 multi core, that the task is doing 4 jobs concurrently and not 1 job divided over 4 threads.
No, the 4-core jobs run one process with four threads. You can see this in the "top" page (Alt-F3) in the console view -- in the midst of a job you should see the master cmsRun process and four instances of "external generator" or some-such.

Edit: The second task even after 13.5 hours requested a new sub-task and did not shutdown by the application, what normally is done after > 12 hours runtime.
So I decided to shutdown the task gracefully to be able to shutdown the host for its overnight rest.

However, that sems to answer my concern that tasks were finishing after just one job.
ID: 50359 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1418
Credit: 9,460,759
RAC: 2,399
Message 50360 - Posted: 8 Jun 2024, 12:36:53 UTC - in response to Message 50359.  

No, the 4-core jobs run one process with four threads. You can see this in the "top" page (Alt-F3) in the console view -- in the midst of a job you should see the master cmsRun process and four instances of "external generator" or some-such.
Thanks Ivan. Yes I've seen the cmsRun process popping up every now and than.
I castched a running.log . . . So the multi 4-core tasks do process 120.000 events by 4 workers.
The 'old' single core tasks in the past did 10.000 events IIRC or is that increased meanwhile for the newer single core jobs?
ID: 50360 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1060
Credit: 7,737,455
RAC: 1,317
Message 50361 - Posted: 8 Jun 2024, 13:07:00 UTC - in response to Message 50360.  
Last modified: 8 Jun 2024, 13:18:53 UTC

No, the 4-core jobs run one process with four threads. You can see this in the "top" page (Alt-F3) in the console view -- in the midst of a job you should see the master cmsRun process and four instances of "external generator" or some-such.
Thanks Ivan. Yes I've seen the cmsRun process popping up every now and than.
I castched a running.log . . . So the multi 4-core tasks do process 120.000 events by 4 workers.
The 'old' single core tasks in the past did 10.000 events IIRC or is that increased meanwhile for the newer single core jobs?

The jobs are not really comparable, we are taking whatever config files we can pry from people who are not that interested in providing them. The "old" jobs had no physics filtering, each of the 10,000 events produced in each job were written to output. You can see the difference in the result files. The "old" jobs produced ~80 MB in ~2 hours, for their10,000 events. The latest workflow has 12,000 generated events per job (according to the grafana Jobs graphs, averaging ~10 hours CPU time[1]), but after the required physics cuts are applied only 018% of the events are saved to output, giving about 2-3 MB per job.
[1] If you go to one of the job graphs, say the "Completed Jobs" one at https://lhcathome.cern.ch/lhcathome/cms_job.php, and click on the CMS Job Monitoring - ES agg data header, you can see in theMemory Usage/CPU Efficiency/Cores used section the average number of cores being used -- the default is the last seven days, which encompasses a few of the small single-core workflows I submitted for debugging.
ID: 50361 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 1060
Credit: 7,737,455
RAC: 1,317
Message 50365 - Posted: 9 Jun 2024, 14:05:38 UTC

Sorry, Sunday.... :-/
New workflow started.
ID: 50365 · Report as offensive     Reply Quote
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · Next

Message boards : CMS Application : CMS@Home difficulties in attempts to prepare for multi-core jobs


©2024 CERN