Message boards :
Theory Application :
New version 263.80
Message board moderation
Author | Message |
---|---|
Send message Joined: 20 Jun 14 Posts: 380 Credit: 238,712 RAC: 0 |
Updated the CVMFS configuration for openhtc.io. |
Send message Joined: 18 Dec 15 Posts: 1748 Credit: 115,253,557 RAC: 90,553 |
On one of my PCs where I have two new Theory task v263.80 running, I make the following observation with one of the two tasks (all 1-core): total runtime: 14:51 hrs; total processor time: 6:01 hrs. for the other one, the processor time is close to the runtime. |
Send message Joined: 9 Dec 14 Posts: 202 Credit: 2,533,875 RAC: 0 |
total runtime: 14:51 hrs; total processor time: 6:01 hrs.Maybe a dead Sherpa job? |
Send message Joined: 2 May 07 Posts: 2189 Credit: 173,020,582 RAC: 47,865 |
total runtime: 14:51 hrs; total processor time: 6:01 hrs.Maybe a dead Sherpa job? Saw this often with a sherpa as last job (looping or can not be finished because of time-limit (18 hours)). If it is possible to stop the looping from the project, than a sherpa can be running also as first job! https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4044#28084 |
Send message Joined: 18 Dec 15 Posts: 1748 Credit: 115,253,557 RAC: 90,553 |
Finally, the task got finished after 18 hours. CPU time was 6 hours.Crddit points: 132,03 :-( For details, pls. see here: https://lhcathome.cern.ch/lhcathome/result.php?resultid=206972998 |
Send message Joined: 14 Jan 10 Posts: 1371 Credit: 9,130,392 RAC: 3,601 |
Finally, the task got finished after 18 hours. CPU time was 6 hours.Crddit points: 132,03 :-(For 6 cpu-hours the credit seems to be OK, but it's not your fault that your machine was occupied for another 12 (idle) hours. Normally the VM should be shutdown when no new job arrives within ~10 minutes. |
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 |
For pythia jobs the time required to process the first 10K events can be used to extrapolate the time required for the entire run with surprising accuracy, or at least so far. That means a babysitter script running on the host can decide from the task time remaining and the forecast time to completion whether to terminate the task gracefully or let it continue. Jobs for herwig, epos, sherpa and phojet generators come so infrequently and my record keeping is so lacking I don't yet know if time to completion can be extrapolated accurately from the first 10K events for them. But I'm working on it. Looping sherpas are easy to detect and terminate gracefully. Been doing it for months. Other features that are working and helpful and apply to LHCb as well: - if Condor doesn't start a new job in 10 minutes then gracefully terminate task - any job that starts after the 10 hour mark triggers graceful task termination (optional) Features for ATLAS too. |
Send message Joined: 18 Dec 15 Posts: 1748 Credit: 115,253,557 RAC: 90,553 |
what caught my eye is that the credit points under version 263.80 are markedly lower than under 263.70 (same PC, same settings) 263.70: total runtime 46.518 secs; CPU time 44.684 secs; points: 1.645,80 263.80: total runtime 65.219 secs; CPU time 64.027 secs; points: 153,57 any logical explanation for this big discrepancy? |
Send message Joined: 16 Jul 05 Posts: 24 Credit: 35,251,537 RAC: 0 |
No idea why, but you're lucky: mine seem to have dropped by a factor of 100! See e.g. hostid=10414406 |
Send message Joined: 18 Dec 15 Posts: 1748 Credit: 115,253,557 RAC: 90,553 |
No idea why, but you're lucky: mine seem to have dropped by a factor of 100!OMG, this is really strange :-( It seems that credit points here is like lottery! Someone should look into this, I guess. |
Send message Joined: 16 Jul 05 Posts: 24 Credit: 35,251,537 RAC: 0 |
No idea why, but you're lucky: mine seem to have dropped by a factor of 100! ... and then this morning the credit rates have come back up again, but only by a factor of ten... |
Send message Joined: 9 Jan 15 Posts: 151 Credit: 431,596,822 RAC: 0 |
a lottery and strange that these task hand out credits so diffrently. ~300 up to ~32k is a big gap. |
Send message Joined: 24 Oct 04 Posts: 1156 Credit: 52,567,052 RAC: 60,579 |
I have my usual 30+ Valids per day with this new version (glad the vdi was smaller than some we get) And I also checked the tasks from another member that does lots of these tasks and it looks the same there. The Valids do have about 60% less as far as credits and we still get that [ERROR] Could not connect to Condor server on port ..... and [ERROR] Condor exited after 10164s without running a job and the same old VM Heartbeat file specified, but missing. once in a while. So I hope there was something good about the Valids as far as actual work being done and better in some way than the previous version. (I imagine those RAC's will be dropping for all running this version) and I noticed the CMS are crashing but I haven't tried any myself yet. (I just watched one of my Theory 2-core tasks finish and the credit dropped about 1600 from the previous version 263.70) |
Send message Joined: 2 May 07 Posts: 2189 Credit: 173,020,582 RAC: 47,865 |
MCProd doesn't working well: lost ratio 100% http://mcplots-dev.cern.ch/production.php?view=status&plots=hourly#plots |
©2024 CERN