Message boards : CMS Application : CMS jobs are becoming available again
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 · Next

AuthorMessage
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 621
Credit: 4,693,390
RAC: 1,876
Message 38122 - Posted: 6 Mar 2019, 16:57:08 UTC

Hi, I'm back (as my colleague Daniela would say, "from the slough of despair").
You may have noticed some small batches of CMS jobs running at LHC@Home-dev over the last few months as we ironed out several technical and a few more political issues. As it stands, the status is that:
    Volunteers (pseudo site T3_CH_Volunteer) can run CMS Monte-Carlo jobs and store them on the DataBridge;
    Laurence's VMS, running as T3_CH_CMSAtHome, can merge the result files into much larger (~3 TB) files and write these onto central CMS storage;
    They can also merge the logs of the MC jobs onto CMS storage, and similarly write the logs of the merge jobs themselves.


There are a few small book-keeping jobs that aren't being done yet, but they are not showstoppers in the greater scheme of things. So, I've been asked to stress test the system with much larger batches of jobs while the finer political points are tidied up.
Consequently, there are now jobs flowing for whoever wants to run them. Laurence has opened up the CMS queue at production LHC@Home as well. Remember that you can run multi-job tasks (i.e. multi-core) on -dev, but still only single-core tasks on production.
I'm currently slurping down a rather large number of the present batch (look at my machines to see what new toy I've snaffled up while I've been away), so I'm going to have to submit a much larger batch again tonight.
So, feel free to feed on the new jobs, and also to report any problems here or on the -dev site, especially if the mix of CPU to.network usage is out of balance


ID: 38122 · Report as offensive     Reply Quote
computezrmle
Avatar

Send message
Joined: 15 Jun 08
Posts: 1140
Credit: 56,057,989
RAC: 95,934
Message 38125 - Posted: 6 Mar 2019, 17:14:32 UTC - in response to Message 38122.  

Great news!
Good to have you back, Ivan.

Let's see if my mothballed CMS-BOINC-client can be reactivated.


I guess a local proxy could be very helpful.
;-D
ID: 38125 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,244,158
RAC: 10,266
Message 38126 - Posted: 6 Mar 2019, 18:04:46 UTC

Welcome back ivan and CMS.
In https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4875#37189 it is mentioned RAM requirement for single core CMS tasks is 2,048 MB (2 GB). Has that changed?
ID: 38126 · Report as offensive     Reply Quote
computezrmle
Avatar

Send message
Joined: 15 Jun 08
Posts: 1140
Credit: 56,057,989
RAC: 95,934
Message 38133 - Posted: 6 Mar 2019, 22:25:16 UTC - in response to Message 38122.  
Last modified: 6 Mar 2019, 22:53:35 UTC

... report any problems here ... especially if the mix of CPU to.network usage is out of balance

After this line:
2019-03-06 23:08:13 (126584): Guest Log: [INFO] CMS application starting. Check log files.

the VM starts a couple of downloads at very poor speed of 10-15 kb/s, e.g. primary_db, security, security/primary_db.

<edit>
Looks like the setup finally completes.
Accidentally also on a second client :-D
But it looks like there are no subtasks and the machines idle.
??
</edit>
ID: 38133 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1127
Credit: 21,764,403
RAC: 28,941
Message 38140 - Posted: 7 Mar 2019, 11:43:04 UTC
Last modified: 7 Mar 2019, 11:43:25 UTC

I started crunching a task some 5 1/2 hours ago, and so far it seems to work well (the progress bar shows around 30% at this point).

BTW: Ivan, welcome back :-)))
ID: 38140 · Report as offensive     Reply Quote
computezrmle
Avatar

Send message
Joined: 15 Jun 08
Posts: 1140
Credit: 56,057,989
RAC: 95,934
Message 38141 - Posted: 7 Mar 2019, 11:53:49 UTC - in response to Message 38140.  

I started crunching a task some 5 1/2 hours ago, and so far it seems to work well (the progress bar shows around 30% at this point).

BTW: Ivan, welcome back :-)))

The progress bar usually shows the fraction that is done until the watchdog will shut down the VM.
30 % of 18 h => 5.4 h

Did you check the VM's console logs?
Do they show jobs processing (ALT-F2, ALT-F4)?
Does the top on ALT-F2 show a process running close to 100 % CPU or does it show an idle VM?
ID: 38141 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1127
Credit: 21,764,403
RAC: 28,941
Message 38143 - Posted: 7 Mar 2019, 17:06:04 UTC - in response to Message 38141.  

Did you check the VM's console logs?
Do they show jobs processing (ALT-F2, ALT-F4)?
Does the top on ALT-F2 show a process running close to 100 % CPU or does it show an idle VM?
hm, console_2 only says "Running job output should appear here", and console_4 says "Output of the job wrapper may appear here". That's all :-(

The Windows Task Manager shows that 1 CPU core is busy with CMS at certain times, but NOT all the time.

So, does this all mean that the task is NOT being processed properly?
ID: 38143 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 749
Credit: 6,028,092
RAC: 700
Message 38144 - Posted: 7 Mar 2019, 17:39:47 UTC - in response to Message 38143.  

So, does this all mean that the task is NOT being processed properly?

ALT-F3 shows the 'top' command output.
When a job is active running you'll see at the top a process called cmsRun using the most cpu-cycles.
ID: 38144 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1127
Credit: 21,764,403
RAC: 28,941
Message 38146 - Posted: 7 Mar 2019, 19:05:33 UTC - in response to Message 38144.  

ALT-F3 shows the 'top' command output.
When a job is active running you'll see at the top a process called cmsRun using the most cpu-cycles.
Thanks for the information. The entries in console_3 seem okay.
ID: 38146 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 621
Credit: 4,693,390
RAC: 1,876
Message 38148 - Posted: 7 Mar 2019, 19:38:23 UTC - in response to Message 38126.  

Welcome back ivan and CMS.
In https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4875#37189 it is mentioned RAM requirement for single core CMS tasks is 2,048 MB (2 GB). Has that changed?

I don't think so, that's the CMS guideline, but it will vary from workflow to workflow.
ID: 38148 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 621
Credit: 4,693,390
RAC: 1,876
Message 38149 - Posted: 7 Mar 2019, 19:40:47 UTC - in response to Message 38133.  

... report any problems here ... especially if the mix of CPU to.network usage is out of balance

After this line:
2019-03-06 23:08:13 (126584): Guest Log: [INFO] CMS application starting. Check log files.

the VM starts a couple of downloads at very poor speed of 10-15 kb/s, e.g. primary_db, security, security/primary_db.


Looks like the setup finally completes.
Accidentally also on a second client :-D
But it looks like there are no subtasks and the machines idle.
??

Yes, I've noticed that the first job in a task is very slow to start up. There must be a bottleneck somewhere in downloading files. Subsequent jobs seem to start faster so I presume that the files are then being fetched from the local cvmfs cache.
ID: 38149 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 621
Credit: 4,693,390
RAC: 1,876
Message 38150 - Posted: 7 Mar 2019, 19:45:11 UTC - in response to Message 38141.  
Last modified: 7 Mar 2019, 19:55:16 UTC

I started crunching a task some 5 1/2 hours ago, and so far it seems to work well (the progress bar shows around 30% at this point).

BTW: Ivan, welcome back :-)))

The progress bar usually shows the fraction that is done until the watchdog will shut down the VM.
30 % of 18 h => 5.4 h

Did you check the VM's console logs?
Do they show jobs processing (ALT-F2, ALT-F4)?
Does the top on ALT-F2 show a process running close to 100 % CPU or does it show an idle VM?

Those windows may not be properly updated at present -- we now run jobs within a singularity container (no, not a black hole!), at CMS's insistence, and the script hadn't been adjusted to reflect that. We tried to correct that this afternoon but it seems there was a typo in the patch; now the rundown script (in the VM console, ALT-F1) complains about an unmatched quote character and hangs. I'm waiting for a patch for the patch...
[Edit] Actually that patch was for the task ending with the message that condor hadn't processed any jobs. The lack of stdout/stderr in the console windows is a related problem but I don't think it's been addressed yet. It will be soon...
[/Edit]
[Edit2] Things are unstable now with the failed patch. Probably best to set No New Tasks overnight, and hopefully we can get it fixed tomorrow. [/Edit2]
ID: 38150 · Report as offensive     Reply Quote
ivan
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Project scientist
Avatar

Send message
Joined: 29 Aug 05
Posts: 621
Credit: 4,693,390
RAC: 1,876
Message 38177 - Posted: 9 Mar 2019, 11:40:00 UTC

There was a hiatus due to our wmagent developing a fault. It took a while to get a message to the right people, but we are up and running again.
ID: 38177 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1127
Credit: 21,764,403
RAC: 28,941
Message 38208 - Posted: 10 Mar 2019, 12:53:39 UTC

CMS is runnig okay now.
Ivan, thanks again for your efforts.

However, something seems to be strange (not to say "wrong") with the credit points: CMS tasks earn only about a third of what is earned for Theory tasks. How come?
ID: 38208 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 329
Credit: 10,752,912
RAC: 15,242
Message 38209 - Posted: 10 Mar 2019, 13:17:07 UTC - in response to Message 38208.  

CMS is runnig okay now.

Erich,

I have completed three, with five more now running OK. But the CPU load is only about 55%. Does that indicate that they are not getting enough work?
I have not tried to open them up to check in detail.
ID: 38209 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 329
Credit: 10,752,912
RAC: 15,242
Message 38211 - Posted: 10 Mar 2019, 15:31:36 UTC - in response to Message 38209.  

But the last five that I had running all failed with "194 (0x000000C2) EXIT_ABORTED_BY_CLIENT".
They ran from about 15 to 17 1/2 hours. I did not reboot, or have any communications problems at my end.
And this is a dedicated machine, with nothing running but those work units on BOINC (and a GPU work unit also).

https://lhcathome.cern.ch/lhcathome/result.php?resultid=218876493
https://lhcathome.cern.ch/lhcathome/result.php?resultid=218876492
https://lhcathome.cern.ch/lhcathome/result.php?resultid=218876490
https://lhcathome.cern.ch/lhcathome/result.php?resultid=218874938
https://lhcathome.cern.ch/lhcathome/result.php?resultid=218876193
ID: 38211 · Report as offensive     Reply Quote
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,244,158
RAC: 10,266
Message 38217 - Posted: 10 Mar 2019, 21:17:09 UTC - in response to Message 38209.  

CMS is runnig okay now.

Erich,

I have completed three, with five more now running OK. But the CPU load is only about 55%. Does that indicate that they are not getting enough work?
I have not tried to open them up to check in detail.


What do you mean by "CPU load"? Do you mean top's %cpu or do you mean the ratio of cpu time to run time. The only CMS task I've completed since the last update reported by Ivan showed a fairly constant ~99 %cpu in top but the ratio of cpu time to run time is 45,764.33/64,179.97 = 71%.
ID: 38217 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Nov 14
Posts: 329
Credit: 10,752,912
RAC: 15,242
Message 38218 - Posted: 10 Mar 2019, 21:34:52 UTC - in response to Message 38217.  

What do you mean by "CPU load"? Do you mean top's %cpu or do you mean the ratio of cpu time to run time. The only CMS task I've completed since the last update reported by Ivan showed a fairly constant ~99 %cpu in top but the ratio of cpu time to run time is 45,764.33/64,179.97 = 71%.

I mean the latter; the ratio of cpu time to run time. It seems to me that it should be higher, but maybe it is normal for this project.
ID: 38218 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1127
Credit: 21,764,403
RAC: 28,941
Message 38219 - Posted: 11 Mar 2019, 5:55:49 UTC

Here too, the total runtime vice CPU time ratio was not too good: 64,861.49 vs. 39,351.53 secs.
ID: 38219 · Report as offensive     Reply Quote
Richie_unstable

Send message
Joined: 26 Oct 18
Posts: 33
Credit: 778,722
RAC: 12
Message 38220 - Posted: 11 Mar 2019, 7:01:37 UTC

My first task went 29,890.44 / 64,890.29 = 0,46 . Feels like a task comes with a load of shielding gas built into it.
ID: 38220 · Report as offensive     Reply Quote
1 · 2 · 3 · 4 · Next

Message boards : CMS Application : CMS jobs are becoming available again


©2019 CERN