Message boards :
CMS Application :
CMS@Home difficulties in attempts to prepare for multi-core jobs
Message board moderation
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · Next
Author | Message |
---|---|
Send message Joined: 24 Oct 04 Posts: 1173 Credit: 54,814,776 RAC: 15,804 |
I agree with that Erich and that we do get a Federica message before starting up again instead of a wild guess |
Send message Joined: 19 Jul 18 Posts: 5 Credit: 313,989 RAC: 85 |
Hi all, sorry for the late answer. I have aborted just now the last workflow submitted on Friday night. The problem is without sending tasks we can not test changes and verify if we fixed the problem or not. So we will send a new tasks when we make a new change in configuration files . We let you know. Federica |
Send message Joined: 18 Dec 15 Posts: 1814 Credit: 118,463,293 RAC: 29,862 |
during the past hours, the BOINC event log showed the following when trying to download CMS tasks: 28.05.2024 20:47:49 | LHC@home | Sending scheduler request: To fetch work. 28.05.2024 20:47:49 | LHC@home | Requesting new tasks for CPU 28.05.2024 20:47:50 | LHC@home | Scheduler request completed: got 0 new tasks 28.05.2024 20:47:50 | LHC@home | Didn't resend lost task MqIODm8WbS5nsSi4apGgGQJmABFKDmABFKDm4ySLDmSpSKDmbdCSCo_0 (expired) 28.05.2024 20:47:50 | LHC@home | Didn't resend lost task JbLMDm8fzS5n9Rq4apoT9bVoABFKDmABFKDmlqFKDm1zMKDmc1sO4m_0 (expired) 28.05.2024 20:47:50 | LHC@home | Didn't resend lost task ie1MDm212S5nsSi4apGgGQJmABFKDmABFKDm4ySLDmFpUKDmdsp22n_0 (expired) 28.05.2024 20:47:50 | LHC@home | Didn't resend lost task B6CNDmwJ7S5nsSi4apGgGQJmABFKDmABFKDm4ySLDmlCVKDmRuwkIm_0 (expired) 28.05.2024 20:47:50 | LHC@home | Didn't resend lost task 6EoNDm4z8S5n9Rq4apoT9bVoABFKDmABFKDmlqFKDmtNNKDmsLoG1m_0 (expired) 28.05.2024 20:47:50 | LHC@home | No tasks sent instead of just saying "no tasks are available for CMS" (at least I guess that this is the case as the project status page shows zero unsent tasks). |
Send message Joined: 18 Dec 15 Posts: 1814 Credit: 118,463,293 RAC: 29,862 |
re my above posting from yesterday evening: I am realizing only now that the task names mentioned in the BOINC event log are for ATLAS tasks, NOT CMS tasks. This makes everything even more weird. |
Send message Joined: 24 Oct 04 Posts: 1173 Credit: 54,814,776 RAC: 15,804 |
Since I did that new version of Boinc d/l (8.0.2) it started (only 2 so far) doing these strange things but I have another one running to see if I can get 3 Invalids in a row https://lhcathome.cern.ch/lhcathome/result.php?resultid=411570028 https://lhcathome.cern.ch/lhcathome/result.php?resultid=411575125 |
Send message Joined: 18 Dec 15 Posts: 1814 Credit: 118,463,293 RAC: 29,862 |
Since I did that new version of Boinc d/l (8.0.2) it started (only 2 so far) doing these strange things but I have another one running to see if I can get 3 Invalids in a rowinteresting information in stderr: 2024-06-04 22:09:11 (2016): VM is no longer is a running state. It is in 'stuck'. 2024-06-04 22:09:11 (2016): VM state change detected. (old = 'running', new = 'stuck') whatever this means ... ? |
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,455 RAC: 1,317 |
Having said that I'm only going to submit 4-core jobs in the near future, today I discovered that at least one of our scripts (the one that updates the website with our logs, under "Show graphics" in boincmgr) doesn't work with multicore VMs. So I'm going to submit a small batch of single-core jobs in the hope that I can catch a task and check if the script works with it. For preference, don't set your hosts to run single-core tasks while I do these checks. |
Send message Joined: 2 May 07 Posts: 2243 Credit: 173,902,375 RAC: 2,013 |
06.06.2024 14:43:29 | LHC@home | No tasks are available for CMS Simulation prefs:Maximale Anzahl Aufgaben 2 Maximale Anzahl CPUs 4 Boinc 8.0.2 - Win11pro ------------------------------- An other PC: https://lhcathome.cern.ch/lhcathome/results.php?hostid=10797673 prefs:Maximale Anzahl Aufgaben 2 Maximale Anzahl CPUs 4 Boinc 7.24.1 - Win11pro |
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,455 RAC: 1,317 |
|
Send message Joined: 2 May 07 Posts: 2243 Credit: 173,902,375 RAC: 2,013 |
Next two started, now 4 multicore CMS running: https://lhcathome.cern.ch/lhcathome/results.php?hostid=10797673 |
Send message Joined: 15 Jun 08 Posts: 2534 Credit: 253,822,106 RAC: 36,869 |
Got 2 task. Each running 2 cmsRun instances (obviously singlecore) side by side in a 4-core VM. The modified "watch_logs" script does not work. Most likely the logs it tries to link into the apache folder do not exist in the job folder. |
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,455 RAC: 1,317 |
Got 2 task. Exactly right; the script was looking for StarterLog.slot1, but the multicore jobs use StarterLog.slot1_1. A simple wildcard addition cleared that hurdle at least. |
Send message Joined: 15 Jun 08 Posts: 2534 Credit: 253,822,106 RAC: 36,869 |
Got 2 task. Right, the script can now find a log like "StarterLog.slot1_1": log_file=$(find /tmp/glide_* -name StarterLog.slot1* | head -n 1) But it still creates a link to "StarterLog" which does not exist: ln -Pf ${log_dir}/StarterLog /var/www/html/logs/StarterLog The responsible script line should be changed to something like this: ln -Pf ${log_dir}/$(basename ${log_file}) /var/www/html/logs/StarterLog |
Send message Joined: 14 Jan 10 Posts: 1418 Credit: 9,460,759 RAC: 2,399 |
The running.log (ALT-F2) of the active sub-task is empty until the job has finished. Then it's filled suddenly with the whole ~144.105 lines of text of the finished job. When a new sub-task starts the running.log is emptied and copied to finished_#.log |
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,455 RAC: 1,317 |
The running.log (ALT-F2) of the active sub-task is empty until the job has finished. Then it's filled suddenly with the whole ~144.105 lines of text of the finished job. Yes, it didn't used to be like that. My best guess at the moment is thar cmsRub behaviour has changed and the logs aren't written until the job ends. Are we actually getting more than one job per task nowadays? From my computer times, I'm thinking not. I could halve the job time to check, but that may get up some people's noses. |
Send message Joined: 14 Jan 10 Posts: 1418 Credit: 9,460,759 RAC: 2,399 |
Are we actually getting more than one job per task nowadays? From my computer times, I'm thinking not. I could halve the job time to check, but that may get up some people's noses.I started this morning 2 multi-core tasks on hostid 10690380 and both tasks have ran 4 jobs inside the VM. First finished task did 12 sequencies of 4 jobs: https://lhcathome.cern.ch/lhcathome/result.php?resultid=411700593 The second will finish soon with done 13 sequencies of 4 jobs: https://lhcathome.cern.ch/lhcathome/result.php?resultid=411701063 At least that's my understanding of a 4 multi core, that the task is doing 4 jobs concurrently and not 1 job divided over 4 threads. Edit: The second task even after 13.5 hours requested a new sub-task and did not shutdown by the application, what normally is done after > 12 hours runtime. So I decided to shutdown the task gracefully to be able to shutdown the host for its overnight rest. |
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,455 RAC: 1,317 |
No, the 4-core jobs run one process with four threads. You can see this in the "top" page (Alt-F3) in the console view -- in the midst of a job you should see the master cmsRun process and four instances of "external generator" or some-such.Are we actually getting more than one job per task nowadays? From my computer times, I'm thinking not. I could halve the job time to check, but that may get up some people's noses.I started this morning 2 multi-core tasks on hostid 10690380 and both tasks have ran 4 jobs inside the VM.
However, that sems to answer my concern that tasks were finishing after just one job. |
Send message Joined: 14 Jan 10 Posts: 1418 Credit: 9,460,759 RAC: 2,399 |
No, the 4-core jobs run one process with four threads. You can see this in the "top" page (Alt-F3) in the console view -- in the midst of a job you should see the master cmsRun process and four instances of "external generator" or some-such.Thanks Ivan. Yes I've seen the cmsRun process popping up every now and than. I castched a running.log . . . So the multi 4-core tasks do process 120.000 events by 4 workers. The 'old' single core tasks in the past did 10.000 events IIRC or is that increased meanwhile for the newer single core jobs? |
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,455 RAC: 1,317 |
No, the 4-core jobs run one process with four threads. You can see this in the "top" page (Alt-F3) in the console view -- in the midst of a job you should see the master cmsRun process and four instances of "external generator" or some-such.Thanks Ivan. Yes I've seen the cmsRun process popping up every now and than. The jobs are not really comparable, we are taking whatever config files we can pry from people who are not that interested in providing them. The "old" jobs had no physics filtering, each of the 10,000 events produced in each job were written to output. You can see the difference in the result files. The "old" jobs produced ~80 MB in ~2 hours, for their10,000 events. The latest workflow has 12,000 generated events per job (according to the grafana Jobs graphs, averaging ~10 hours CPU time[1]), but after the required physics cuts are applied only 018% of the events are saved to output, giving about 2-3 MB per job. [1] If you go to one of the job graphs, say the "Completed Jobs" one at https://lhcathome.cern.ch/lhcathome/cms_job.php, and click on the CMS Job Monitoring - ES agg data header, you can see in theMemory Usage/CPU Efficiency/Cores used section the average number of cores being used -- the default is the last seven days, which encompasses a few of the small single-core workflows I submitted for debugging. |
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,455 RAC: 1,317 |
|
©2024 CERN