Message boards :
CMS Application :
CMS jobs are becoming available again
Message board moderation
Author | Message |
---|---|
Send message Joined: 29 Aug 05 Posts: 998 Credit: 6,264,307 RAC: 71 |
Hi, I'm back (as my colleague Daniela would say, "from the slough of despair"). You may have noticed some small batches of CMS jobs running at LHC@Home-dev over the last few months as we ironed out several technical and a few more political issues. As it stands, the status is that:
Laurence's VMS, running as T3_CH_CMSAtHome, can merge the result files into much larger (~3 TB) files and write these onto central CMS storage; They can also merge the logs of the MC jobs onto CMS storage, and similarly write the logs of the merge jobs themselves.
|
Send message Joined: 15 Jun 08 Posts: 2386 Credit: 223,015,645 RAC: 136,238 |
Great news! Good to have you back, Ivan. Let's see if my mothballed CMS-BOINC-client can be reactivated. I guess a local proxy could be very helpful. ;-D |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 |
Welcome back ivan and CMS. In https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4875#37189 it is mentioned RAM requirement for single core CMS tasks is 2,048 MB (2 GB). Has that changed? |
Send message Joined: 15 Jun 08 Posts: 2386 Credit: 223,015,645 RAC: 136,238 |
... report any problems here ... especially if the mix of CPU to.network usage is out of balance After this line: 2019-03-06 23:08:13 (126584): Guest Log: [INFO] CMS application starting. Check log files. the VM starts a couple of downloads at very poor speed of 10-15 kb/s, e.g. primary_db, security, security/primary_db. <edit> Looks like the setup finally completes. Accidentally also on a second client :-D But it looks like there are no subtasks and the machines idle. ?? </edit> |
Send message Joined: 18 Dec 15 Posts: 1686 Credit: 100,464,909 RAC: 103,972 |
I started crunching a task some 5 1/2 hours ago, and so far it seems to work well (the progress bar shows around 30% at this point). BTW: Ivan, welcome back :-))) |
Send message Joined: 15 Jun 08 Posts: 2386 Credit: 223,015,645 RAC: 136,238 |
I started crunching a task some 5 1/2 hours ago, and so far it seems to work well (the progress bar shows around 30% at this point). The progress bar usually shows the fraction that is done until the watchdog will shut down the VM. 30 % of 18 h => 5.4 h Did you check the VM's console logs? Do they show jobs processing (ALT-F2, ALT-F4)? Does the top on ALT-F2 show a process running close to 100 % CPU or does it show an idle VM? |
Send message Joined: 18 Dec 15 Posts: 1686 Credit: 100,464,909 RAC: 103,972 |
Did you check the VM's console logs?hm, console_2 only says "Running job output should appear here", and console_4 says "Output of the job wrapper may appear here". That's all :-( The Windows Task Manager shows that 1 CPU core is busy with CMS at certain times, but NOT all the time. So, does this all mean that the task is NOT being processed properly? |
Send message Joined: 14 Jan 10 Posts: 1268 Credit: 8,421,637 RAC: 1,939 |
So, does this all mean that the task is NOT being processed properly? ALT-F3 shows the 'top' command output. When a job is active running you'll see at the top a process called cmsRun using the most cpu-cycles. |
Send message Joined: 18 Dec 15 Posts: 1686 Credit: 100,464,909 RAC: 103,972 |
ALT-F3 shows the 'top' command output.Thanks for the information. The entries in console_3 seem okay. |
Send message Joined: 29 Aug 05 Posts: 998 Credit: 6,264,307 RAC: 71 |
Welcome back ivan and CMS. I don't think so, that's the CMS guideline, but it will vary from workflow to workflow. |
Send message Joined: 29 Aug 05 Posts: 998 Credit: 6,264,307 RAC: 71 |
... report any problems here ... especially if the mix of CPU to.network usage is out of balance Yes, I've noticed that the first job in a task is very slow to start up. There must be a bottleneck somewhere in downloading files. Subsequent jobs seem to start faster so I presume that the files are then being fetched from the local cvmfs cache. |
Send message Joined: 29 Aug 05 Posts: 998 Credit: 6,264,307 RAC: 71 |
I started crunching a task some 5 1/2 hours ago, and so far it seems to work well (the progress bar shows around 30% at this point). Those windows may not be properly updated at present -- we now run jobs within a singularity container (no, not a black hole!), at CMS's insistence, and the script hadn't been adjusted to reflect that. We tried to correct that this afternoon but it seems there was a typo in the patch; now the rundown script (in the VM console, ALT-F1) complains about an unmatched quote character and hangs. I'm waiting for a patch for the patch... [Edit] Actually that patch was for the task ending with the message that condor hadn't processed any jobs. The lack of stdout/stderr in the console windows is a related problem but I don't think it's been addressed yet. It will be soon... [/Edit] [Edit2] Things are unstable now with the failed patch. Probably best to set No New Tasks overnight, and hopefully we can get it fixed tomorrow. [/Edit2] |
Send message Joined: 29 Aug 05 Posts: 998 Credit: 6,264,307 RAC: 71 |
|
Send message Joined: 18 Dec 15 Posts: 1686 Credit: 100,464,909 RAC: 103,972 |
CMS is runnig okay now. Ivan, thanks again for your efforts. However, something seems to be strange (not to say "wrong") with the credit points: CMS tasks earn only about a third of what is earned for Theory tasks. How come? |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 |
CMS is runnig okay now. Erich, I have completed three, with five more now running OK. But the CPU load is only about 55%. Does that indicate that they are not getting enough work? I have not tried to open them up to check in detail. |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 |
But the last five that I had running all failed with "194 (0x000000C2) EXIT_ABORTED_BY_CLIENT". They ran from about 15 to 17 1/2 hours. I did not reboot, or have any communications problems at my end. And this is a dedicated machine, with nothing running but those work units on BOINC (and a GPU work unit also). https://lhcathome.cern.ch/lhcathome/result.php?resultid=218876493 https://lhcathome.cern.ch/lhcathome/result.php?resultid=218876492 https://lhcathome.cern.ch/lhcathome/result.php?resultid=218876490 https://lhcathome.cern.ch/lhcathome/result.php?resultid=218874938 https://lhcathome.cern.ch/lhcathome/result.php?resultid=218876193 |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 |
CMS is runnig okay now. What do you mean by "CPU load"? Do you mean top's %cpu or do you mean the ratio of cpu time to run time. The only CMS task I've completed since the last update reported by Ivan showed a fairly constant ~99 %cpu in top but the ratio of cpu time to run time is 45,764.33/64,179.97 = 71%. |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 |
What do you mean by "CPU load"? Do you mean top's %cpu or do you mean the ratio of cpu time to run time. The only CMS task I've completed since the last update reported by Ivan showed a fairly constant ~99 %cpu in top but the ratio of cpu time to run time is 45,764.33/64,179.97 = 71%. I mean the latter; the ratio of cpu time to run time. It seems to me that it should be higher, but maybe it is normal for this project. |
Send message Joined: 18 Dec 15 Posts: 1686 Credit: 100,464,909 RAC: 103,972 |
Here too, the total runtime vice CPU time ratio was not too good: 64,861.49 vs. 39,351.53 secs. |
Send message Joined: 26 Oct 18 Posts: 90 Credit: 4,188,598 RAC: 0 |
My first task went 29,890.44 / 64,890.29 = 0,46 . Feels like a task comes with a load of shielding gas built into it. |
©2024 CERN