Message boards :
Theory Application :
Extreme Overload caused by a Theory Task
Message board moderation
Author | Message |
---|---|
Send message Joined: 15 Jun 08 Posts: 2519 Credit: 250,934,990 RAC: 127,970 |
Just killed a task that caused an extreme overload caused by lots of child processes. Could this be a configuration error? 08:02:55 (71261): wrapper (7.15.26016): starting 08:02:55 (71261): wrapper (7.15.26016): starting 08:02:55 (71261): wrapper: running ../../projects/lhcathome.cern.ch_lhcathome/cranky-0.0.31 () 08:02:55 CET +01:00 2020-02-27: cranky-0.0.31: [INFO] Detected Theory App 08:02:55 CET +01:00 2020-02-27: cranky-0.0.31: [INFO] Checking CVMFS. 08:03:04 CET +01:00 2020-02-27: cranky-0.0.31: [INFO] Checking runc. 08:03:04 CET +01:00 2020-02-27: cranky-0.0.31: [INFO] Creating the filesystem. 08:03:05 CET +01:00 2020-02-27: cranky-0.0.31: [INFO] Using /cvmfs/cernvm-prod.cern.ch/cvm3 08:03:05 CET +01:00 2020-02-27: cranky-0.0.31: [INFO] Updating config.json. 08:03:05 CET +01:00 2020-02-27: cranky-0.0.31: [INFO] Running Container 'runc'. 08:03:08 CET +01:00 2020-02-27: cranky-0.0.31: [INFO] ===> [runRivet] Thu Feb 27 07:03:07 UTC 2020 [boinc pp zinclusive 7000 -,-,50,130 - madgraph5amc 2.6.2.atlas nlo2jet 100000 42] Output from pstree: runc(72593)─┬─job(72615)───runRivet.sh(72672)─┬─rivetvm.exe(74932) │ ├─rungen.sh(74931)───python(76440)───python(76512)─┬─ajob1(362)───madevent_mintMC(508) │ │ ├─ajob1(796)───madevent_mintMC(1001) │ │ ├─ajob1(5993)───madevent_mintMC(6482) │ │ ├─ajob1(6099)───madevent_mintMC(6790) │ │ ├─ajob1(8332)───madevent_mintMC(8587) │ │ ├─ajob1(8983)───madevent_mintMC(9247) │ │ ├─ajob1(11338)───madevent_mintMC(11651) │ │ ├─ajob1(11367)───madevent_mintMC(11593) │ │ ├─ajob1(12360)───madevent_mintMC(12644) │ │ ├─ajob1(12519)───madevent_mintMC(12767) │ │ ├─ajob1(13000)───madevent_mintMC(13308) │ │ ├─ajob1(13192)───madevent_mintMC(13377) │ │ ├─ajob1(13865)───madevent_mintMC(14117) │ │ ├─ajob1(14173)───madevent_mintMC(14419) │ │ ├─ajob1(14576)───madevent_mintMC(14937) │ │ ├─ajob1(16732)───madevent_mintMC(16967) │ │ ├─ajob1(16975)───madevent_mintMC(17237) │ │ ├─ajob1(18824)───madevent_mintMC(19014) │ │ ├─ajob1(38347)───madevent_mintMC(38702) │ │ ├─ajob1(40479)───madevent_mintMC(40723) │ │ ├─ajob1(74262)───madevent_mintMC(74533) │ │ ├─ajob1(77566)───madevent_mintMC(77734) │ │ ├─ajob1(99253)───madevent_mintMC(99488) │ │ ├─ajob1(101708)───madevent_mintMC(101995) │ │ ├─ajob1(102192)───madevent_mintMC(102431) │ │ ├─ajob1(103858)───madevent_mintMC(104120) │ │ ├─ajob1(107084)───madevent_mintMC(107228) │ │ ├─ajob1(117794)───madevent_mintMC(118021) │ │ ├─ajob1(118254)───madevent_mintMC(118547) │ │ ├─ajob1(121948)───madevent_mintMC(122287) │ │ ├─ajob1(129084)───madevent_mintMC(129321) │ │ ├─ajob1(130472)───madevent_mintMC(130704) │ │ ├─{python}(77931) │ │ ├─{python}(114243) │ │ ├─{python}(114244) │ │ ├─{python}(114245) │ │ ├─{python}(114246) │ │ ├─{python}(114247) │ │ ├─{python}(114248) │ │ ├─{python}(114250) │ │ ├─{python}(114252) │ │ ├─{python}(114253) │ │ ├─{python}(114254) │ │ ├─{python}(114255) │ │ ├─{python}(114256) │ │ ├─{python}(114257) │ │ ├─{python}(114258) │ │ ├─{python}(114259) │ │ ├─{python}(114261) │ │ ├─{python}(114262) │ │ ├─{python}(114267) │ │ ├─{python}(114268) │ │ ├─{python}(114269) │ │ ├─{python}(114272) │ │ ├─{python}(114274) │ │ ├─{python}(114276) │ │ ├─{python}(114283) │ │ ├─{python}(114284) │ │ ├─{python}(114285) │ │ ├─{python}(114286) │ │ ├─{python}(114287) │ │ ├─{python}(114288) │ │ ├─{python}(114289) │ │ ├─{python}(114291) │ │ ├─{python}(114294) │ │ ├─{python}(114295) │ │ ├─{python}(114296) │ │ ├─{python}(114297) │ │ ├─{python}(114298) │ │ ├─{python}(114300) │ │ ├─{python}(114301) │ │ ├─{python}(114302) │ │ ├─{python}(114303) │ │ ├─{python}(114304) │ │ ├─{python}(114307) │ │ ├─{python}(114308) │ │ ├─{python}(114309) │ │ ├─{python}(114311) │ │ ├─{python}(114313) │ │ ├─{python}(114314) │ │ ├─{python}(114316) │ │ ├─{python}(114317) │ │ ├─{python}(114318) │ │ ├─{python}(114319) │ │ ├─{python}(114320) │ │ ├─{python}(114321) │ │ ├─{python}(114322) │ │ ├─{python}(114323) │ │ ├─{python}(114324) │ │ ├─{python}(114325) │ │ ├─{python}(114326) │ │ ├─{python}(114327) │ │ ├─{python}(114328) │ │ ├─{python}(114329) │ │ ├─{python}(114330) │ │ ├─{python}(114331) │ │ ├─{python}(114332) │ │ ├─{python}(12377) │ │ ├─{python}(12379) │ │ ├─{python}(12380) │ │ ├─{python}(12381) │ │ ├─{python}(12382) │ │ ├─{python}(12383) │ │ ├─{python}(12386) │ │ ├─{python}(12387) │ │ ├─{python}(12388) │ │ ├─{python}(12389) │ │ ├─{python}(12390) │ │ ├─{python}(12391) │ │ ├─{python}(12392) │ │ ├─{python}(12393) │ │ ├─{python}(12394) │ │ ├─{python}(12396) │ │ ├─{python}(12397) │ │ ├─{python}(12398) │ │ ├─{python}(12399) │ │ ├─{python}(12400) │ │ ├─{python}(12407) │ │ ├─{python}(12409) │ │ ├─{python}(12410) │ │ ├─{python}(12411) │ │ ├─{python}(12412) │ │ ├─{python}(12415) │ │ ├─{python}(12419) │ │ ├─{python}(12420) │ │ ├─{python}(12421) │ │ ├─{python}(12425) │ │ ├─{python}(12426) │ │ ├─{python}(12428) │ │ ├─{python}(76120) │ │ ├─{python}(76121) │ │ ├─{python}(76122) │ │ ├─{python}(76125) │ │ ├─{python}(76126) │ │ ├─{python}(76127) │ │ ├─{python}(76132) │ │ ├─{python}(76135) │ │ ├─{python}(76136) │ │ ├─{python}(76137) │ │ ├─{python}(76138) │ │ ├─{python}(76140) │ │ ├─{python}(76145) │ │ ├─{python}(76148) │ │ ├─{python}(76149) │ │ ├─{python}(76151) │ │ ├─{python}(76153) │ │ ├─{python}(76155) │ │ ├─{python}(76156) │ │ ├─{python}(76157) │ │ ├─{python}(76158) │ │ ├─{python}(76160) │ │ ├─{python}(76161) │ │ ├─{python}(76164) │ │ ├─{python}(76166) │ │ ├─{python}(76167) │ │ ├─{python}(76169) │ │ ├─{python}(76174) │ │ ├─{python}(76175) │ │ ├─{python}(76181) │ │ ├─{python}(76183) │ │ └─{python}(76185) │ └─sleep(19147) ├─{runc}(72596) ├─{runc}(72597) ├─{runc}(72598) ├─{runc}(72599) ├─{runc}(72600) ├─{runc}(72602) ├─{runc}(72607) └─{runc}(72627) |
Send message Joined: 14 Jan 10 Posts: 1409 Credit: 9,325,730 RAC: 9,392 |
Just killed a task that caused an extreme overload caused by lots of child processes.Before the disk_bound extension this type of tasks caused exceeded disk limits. See also: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5266 168 of all tasks are with this madgraph5amc generator whereof 73 never succeed so far. |
Send message Joined: 15 Jun 08 Posts: 2519 Credit: 250,934,990 RAC: 127,970 |
Got again one of those "mad" Theory tasks https://lhcathome.cern.ch/lhcathome/result.php?resultid=269060049 ppbar zinclusive 1800 -,-,50,130 - madgraph5amc 2.6.0.atlas nlo2jet The following line can be found in runRivet.log: INFO: Generated 232 subprocesses with 36320 real emission diagrams, 2560 born diagrams and 47392 virtual diagrams It looks like all subprocesses are running concurrently which puts an extreme load on the host. Like other Theory tasks this type should also respect the 1 core behavior and avoid running that many processes concurrently. |
Send message Joined: 15 Jun 08 Posts: 2519 Credit: 250,934,990 RAC: 127,970 |
Had to kill the next one. https://lhcathome.cern.ch/lhcathome/result.php?resultid=272649942 ===> [runRivet] Sat May 9 20:00:42 UTC 2020 [boinc pp zinclusive 7000 20,-,50,200 - madgraph5amc 2.6.5.atlas nlo2jet 100000 2] Output from pstree: runc(29966)─┬─job(29981)───runRivet.sh(30020)─┬─rivetvm.exe(35057) │ ├─rungen.sh(35056)───python(38664)───python(38994)─┬─ajob1(2788)───madevent_mintMC(3078) │ │ ├─ajob1(3053)───madevent_mintMC(3609) │ │ ├─ajob1(3802)───madevent_mintMC(4288) │ │ ├─ajob1(5711)───madevent_mintMC(6044) │ │ ├─ajob1(6618)───madevent_mintMC(6971) │ │ ├─ajob1(8404)───madevent_mintMC(8774) │ │ ├─ajob1(8425)───madevent_mintMC(8764) │ │ ├─ajob1(8859)───madevent_mintMC(9381) │ │ ├─ajob1(10892)───madevent_mintMC(11141) │ │ ├─ajob1(10980)───madevent_mintMC(11225) │ │ ├─ajob1(14602)───madevent_mintMC(15066) │ │ ├─ajob1(15041)───madevent_mintMC(15428) │ │ ├─ajob1(22774)───madevent_mintMC(23151) │ │ ├─ajob1(25830)───madevent_mintMC(26082) │ │ ├─ajob1(26749)───madevent_mintMC(26981) │ │ ├─ajob1(71074)───madevent_mintMC(71273) │ │ ├─ajob1(74652)───madevent_mintMC(75090) │ │ ├─ajob1(77463)───madevent_mintMC(77716) │ │ ├─ajob1(83957)───madevent_mintMC(84239) │ │ ├─ajob1(85033)───madevent_mintMC(85257) │ │ ├─ajob1(85063)───madevent_mintMC(85275) │ │ ├─ajob1(85411)───madevent_mintMC(85646) │ │ ├─ajob1(93345)───madevent_mintMC(93760) │ │ ├─ajob1(98350)───madevent_mintMC(98648) │ │ ├─ajob1(99279)───madevent_mintMC(99542) │ │ ├─ajob1(105494)───madevent_mintMC(105773) │ │ ├─ajob1(114285)───madevent_mintMC(114609) │ │ ├─ajob1(122438)───madevent_mintMC(122825) │ │ ├─ajob1(124210)───madevent_mintMC(124616) │ │ ├─ajob1(124648)───madevent_mintMC(124896) │ │ ├─ajob1(125653)───madevent_mintMC(125842) │ │ ├─ajob1(129154)───madevent_mintMC(129457) │ │ ├─{python}(42664) │ │ ├─{python}(57884) │ │ ├─{python}(57885) │ │ ├─{python}(57889) │ │ ├─{python}(57893) │ │ ├─{python}(57894) │ │ ├─{python}(57895) │ │ ├─{python}(57896) │ │ ├─{python}(57901) │ │ ├─{python}(57902) │ │ ├─{python}(57903) │ │ ├─{python}(57906) │ │ ├─{python}(57913) │ │ ├─{python}(57916) │ │ ├─{python}(57919) │ │ ├─{python}(57922) │ │ ├─{python}(57923) │ │ ├─{python}(57924) │ │ ├─{python}(57925) │ │ ├─{python}(57926) │ │ ├─{python}(57927) │ │ ├─{python}(57928) │ │ ├─{python}(57929) │ │ ├─{python}(57930) │ │ ├─{python}(57931) │ │ ├─{python}(57932) │ │ ├─{python}(57934) │ │ ├─{python}(57935) │ │ ├─{python}(57936) │ │ ├─{python}(57937) │ │ ├─{python}(57938) │ │ ├─{python}(57939) │ │ ├─{python}(57940) │ │ ├─{python}(81110) │ │ ├─{python}(81113) │ │ ├─{python}(81114) │ │ ├─{python}(81115) │ │ ├─{python}(81116) │ │ ├─{python}(81117) │ │ ├─{python}(81118) │ │ ├─{python}(81119) │ │ ├─{python}(81120) │ │ ├─{python}(81121) │ │ ├─{python}(81122) │ │ ├─{python}(81123) │ │ ├─{python}(81124) │ │ ├─{python}(81125) │ │ ├─{python}(81126) │ │ ├─{python}(81127) │ │ ├─{python}(81128) │ │ ├─{python}(81129) │ │ ├─{python}(81130) │ │ ├─{python}(81131) │ │ ├─{python}(81132) │ │ ├─{python}(81133) │ │ ├─{python}(81134) │ │ ├─{python}(81135) │ │ ├─{python}(81136) │ │ ├─{python}(81137) │ │ ├─{python}(81138) │ │ ├─{python}(81139) │ │ ├─{python}(81140) │ │ ├─{python}(81141) │ │ ├─{python}(81142) │ │ ├─{python}(81143) │ │ ├─{python}(130813) │ │ ├─{python}(130815) │ │ ├─{python}(130816) │ │ ├─{python}(130817) │ │ ├─{python}(130818) │ │ ├─{python}(130819) │ │ ├─{python}(130820) │ │ ├─{python}(130822) │ │ ├─{python}(130823) │ │ ├─{python}(130825) │ │ ├─{python}(130826) │ │ ├─{python}(130828) │ │ ├─{python}(130829) │ │ ├─{python}(130830) │ │ ├─{python}(130833) │ │ ├─{python}(130835) │ │ ├─{python}(130836) │ │ ├─{python}(130837) │ │ ├─{python}(130838) │ │ ├─{python}(130839) │ │ ├─{python}(130840) │ │ ├─{python}(130841) │ │ ├─{python}(130842) │ │ ├─{python}(130844) │ │ ├─{python}(130845) │ │ ├─{python}(130846) │ │ ├─{python}(130848) │ │ ├─{python}(130849) │ │ ├─{python}(130850) │ │ ├─{python}(130851) │ │ ├─{python}(130852) │ │ └─{python}(130853) │ └─sleep(28012) ├─{runc}(29970) ├─{runc}(29971) ├─{runc}(29972) ├─{runc}(29973) ├─{runc}(29975) ├─{runc}(29977) ├─{runc}(29979) └─{runc}(29980) |
Send message Joined: 17 Oct 06 Posts: 82 Credit: 56,784,235 RAC: 18,354 |
Could everything running concurrently be the reason that my computer crashes when its running 32 theory tasks at once? |
Send message Joined: 15 Jun 08 Posts: 2519 Credit: 250,934,990 RAC: 127,970 |
The error mentioned in this thread so far affected only madgraph tasks. Only 0.25 % of all Theory runs currently listed "active" in mcplots are configured to use madgraph. I guess vbox tasks are not affected as the processes inside the VM see only 1 core. To get help regarding more general issues you may post a more detailed description, preferrably in a fresh thread. |
Send message Joined: 13 Jul 05 Posts: 169 Credit: 15,000,737 RAC: 547 |
It looks like all subprocesses are running concurrently which puts an extreme load on the host.I've stumbled on 272677122 which has ATM taken over all 8 cores on the host (but for a loadaverage ~8); so not overwhelming the host completely but certainly pushing out all other BOINC tasks. top reports the master python process is taking up 69.3% of the memory! ===> [runRivet] Sun May 10 16:33:34 UTC 2020 [boinc pp zinclusive 7000 20,-,50,200 - madgraph5amc 2.6.6.atlas nlo2jet 100000 2] > grep subprocess /var/lib/boinc/slots/2/cernvm/shared/runRivet.log INFO: Generated 16 subprocesses with 192 real emission diagrams, 32 born diagrams and 32 virtual diagrams INFO: Generated 48 subprocesses with 2944 real emission diagrams, 192 born diagrams and 1440 virtual diagrams INFO: Generated 232 subprocesses with 36320 real emission diagrams, 2560 born diagrams and 47392 virtual diagrams > pstree -c 28778 wrapper_2019_03─┬─cranky-0.0.31───runc─┬─job───runRivet.sh─┬─rivetvm.exe │ │ ├─rungen.sh───python───python─┬─ajob1───madevent_mintMC │ │ │ ├─ajob1───madevent_mintMC │ │ │ ├─ajob1───madevent_mintMC │ │ │ ├─ajob1───madevent_mintMC │ │ │ ├─ajob1───madevent_mintMC │ │ │ ├─ajob1───madevent_mintMC │ │ │ ├─ajob1───madevent_mintMC │ │ │ ├─ajob1───madevent_mintMC │ │ │ ├─{python} │ │ │ ├─{python} │ │ │ ├─{python} │ │ │ ├─{python} │ │ │ ├─{python} │ │ │ ├─{python} │ │ │ ├─{python} │ │ │ ├─{python} │ │ │ ├─{python} │ │ │ ├─{python} │ │ │ ├─{python} │ │ │ ├─{python} │ │ │ ├─{python} │ │ │ ├─{python} │ │ │ ├─{python} │ │ │ ├─{python} │ │ │ ├─{python} │ │ │ ├─{python} │ │ │ ├─{python} │ │ │ ├─{python} │ │ │ ├─{python} │ │ │ ├─{python} │ │ │ ├─{python} │ │ │ ├─{python} │ │ │ └─{python} │ │ └─sleep │ ├─{runc} │ ├─{runc} │ ├─{runc} │ ├─{runc} │ ├─{runc} │ ├─{runc} │ ├─{runc} │ └─{runc} └─{wrapper_2019_03} Log file presently ends with NFO: Idle: 0, Running: 3, Completed: 444 [ 22m 15s ] INFO: Idle: 0, Running: 2, Completed: 445 [ 22m 15s ] INFO: Idle: 0, Running: 0, Completed: 447 [ 22m 15s ] INFO: Doing reweight INFO: Idle: 0, Running: 2, Completed: 445 [ current time: 13h32 ] INFO: Idle: 0, Running: 1, Completed: 446 [ 0.12s ] INFO: Idle: 0, Running: 0, Completed: 447 [ 0.74s ] INFO: Collecting eventsand doesn't give any clear indication on progress - it's completed 447 out of... 232? 2560? what? |
Send message Joined: 15 Jun 08 Posts: 2519 Credit: 250,934,990 RAC: 127,970 |
I guess the BOINC client still treats the task as 1-core. Worst case (on an 8 core CPU) would be that BOINC starts 8 of them concurrently and the load average jumps to 8*8=64 (plus normal work). |
Send message Joined: 13 Jul 05 Posts: 169 Credit: 15,000,737 RAC: 547 |
I guess the BOINC client still treats the task as 1-core.Yes: boinccmd --get_tasks reported name: Theory_2390-1152716-2_0 WU name: Theory_2390-1152716-2 project URL: https://lhcathome.cern.ch/lhcathome/ received: Sun May 10 13:17:48 2020 report deadline: Wed May 20 13:17:47 2020 ready to report: no state: downloaded scheduler state: scheduled active_task_state: EXECUTING app version num: 30005 resources: 1 CPU estimated CPU time remaining: 1.650947 CPU time at last checkpoint: 0.000000 current CPU time: 20120.640000 fraction done: 0.999978 swap size: 8667 MB working set size: 5558 MBbut it started (eventually) 8 active processes, and the BOINC client was sensible and didn't start any more tasks as the existing ones finished off until the madgraph task completed: 15835: 11-May-2020 15:04:21 (low) [LHC@home] Starting task RAfMDmsbmrwnsSi4apGgGQJmABFKDmABFKDmBVrYDmABFKDmcy862n_0 15836: 11-May-2020 15:04:21 (low) [LHC@home] Starting task Theory_2390-1102868-2_0 15837: 11-May-2020 15:04:21 (low) [LHC@home] Starting task Theory_2390-1146431-2_1 15838: 11-May-2020 15:04:22 (low) [LHC@home] Starting task Theory_2390-1087074-2_0 15839: 11-May-2020 15:26:22 (low) [LHC@home] Computation for task Theory_2390-1152716-2_0 finished 15840: 11-May-2020 15:26:22 (low) [LHC@home] Starting task Theory_2390-1113717-2_0 15841: 11-May-2020 15:26:24 (low) [LHC@home] Started upload of Theory_2390-1152716-2_0_r1750715384_result 15842: 11-May-2020 15:26:29 (low) [LHC@home] Finished upload of Theory_2390-1152716-2_0_r1750715384_result Worst case (on an 8 core CPU) would be that BOINC starts 8 of them concurrently and the load average jumps to 8*8=64 (plus normal work).Maybe: I didn't check the load last night when it would still have been fighting with multi-core Atlas tasks. It looks like the BOINC client is trying to do the right thing, but - as you've pointed out below - the tasks themselves should be running the subtasks in series, not parallel. |
Send message Joined: 13 Jul 05 Posts: 169 Credit: 15,000,737 RAC: 547 |
IMO it's also cheating us on the credit: For single-threaded 272136516 02:58:17 (13728): cranky exited; CPU time 674129.550630and 6,326.46 credit, i.e. 6.3k cr. for ~630k s CPU time. While for "multi-core" 272677122, 15:26:19 (28778): cranky exited; CPU time 447204.364571and 657.71 credit, i.e. 0.7k cr. for 447k s CPU. I don't think my machines are a factor 5 different, but the Run times are! |
Send message Joined: 13 Jul 05 Posts: 169 Credit: 15,000,737 RAC: 547 |
Another one: 273087449. boinccmd --get_tasks reports 2) ----------- name: Theory_2390-1153380-3_0 WU name: Theory_2390-1153380-3 project URL: https://lhcathome.cern.ch/lhcathome/ received: Thu May 14 00:45:03 2020 report deadline: Sun May 24 00:45:02 2020 ready to report: no state: downloaded scheduler state: scheduled active_task_state: EXECUTING app version num: 30006 resources: 1 CPU estimated CPU time remaining: 0.017439 slot: 1 PID: 8741 CPU time at last checkpoint: 0.000000 current CPU time: 10407.640000 fraction done: 1.000000 swap size: 7842 MB working set size: 6124 MBand pstree -c 8741 reports wrapper_2019_03─┬─cranky-0.0.32───runc─┬─job───runRivet.sh─┬─rivetvm.exe │ │ ├─rungen.sh───python───python─┬─ajob1───madevent_mintMC │ │ │ ├─ajob1───madevent_mintMC │ │ │ ├─ajob1───madevent_mintMC │ │ │ ├─ajob1───madevent_mintMC │ │ │ ├─{python} │ │ │ ├─{python} │ │ │ ├─{python} │ │ │ ├─{python} │ │ │ ├─{python} │ │ │ ├─{python} │ │ │ ├─{python} │ │ │ ├─{python} │ │ │ ├─{python} │ │ │ ├─{python} │ │ │ ├─{python} │ │ │ ├─{python} │ │ │ └─{python} │ │ └─sleep │ ├─{runc} │ ├─{runc} │ ├─{runc} │ ├─{runc} │ ├─{runc} │ ├─{runc} │ ├─{runc} │ ├─{runc} │ └─{runc} └─{wrapper_2019_03} Again looks like the BOINC client is trying to do the right thing by not starting any new tasks to keep the load down to 4, but - as you've pointed out below - the tasks themselves should be running the subtasks in series, not parallel. This task has been hogging the entire machine for 12+ hours now. |
Send message Joined: 15 Jun 08 Posts: 2519 Credit: 250,934,990 RAC: 127,970 |
In your example "wrapper_2019_03" is the process that is started by your BOINC client and everything below "runc" is hidden in a container controlled by runc. Hence your BOINC client treats it as a 1 core task. Other Theory tasks also run 2 main processes inside a runc container, rivetvm.exe and (e.g.) pythia and some of them also cause a minor overload. But as the pythia output is used as rivetvm input they automatically throttle each other. In case of madevent it looks like the scripts inside the container do their own test regarding the CPU capabilities and set up as many threads as cores are reported. I also don't see that the madevents get throttled. It's a job for the team maintaining the scientific app to investigate and correct this behavior. |
Send message Joined: 14 Jan 10 Posts: 1409 Credit: 9,325,730 RAC: 9,392 |
For BOINC it seems OK, but not for science. https://lhcathome.cern.ch/lhcathome/result.php?resultid=273089431 ===> [runRivet] Sat May 16 10:50:01 UTC 2020 [boinc pp zinclusive 7000 -,-,50,130 - madgraph5amc 2.7.2.atlas3 nlo1jet 100000 3] after ~100 minutes run time on VBox: |
Send message Joined: 26 Nov 10 Posts: 11 Credit: 1,435,923 RAC: 0 |
Hi All, Indeed the madgraph code default is to use all CPU cores. This is corrected and the limit now is set to 2 cores max. (It will take a few days until new jobs will arrive.) |
Send message Joined: 15 Jun 08 Posts: 2519 Credit: 250,934,990 RAC: 127,970 |
Thanks a lot. |
Send message Joined: 8 Dec 19 Posts: 37 Credit: 7,587,438 RAC: 0 |
I got one of these MadGraph tasks today. I was perplexed as to why BOINC was running just 1 native Theory task (as opposed to 8 concurrently) for no reason I could find, never seen that before. Indeed the madgraph code default is to use all CPU cores. It seem like it has not been corrected. Any way to tell how long it'll take to finish? Right now it's at 99.920% and 11h 42m elapsed time. |
Send message Joined: 15 Jun 08 Posts: 2519 Credit: 250,934,990 RAC: 127,970 |
Would be good to get more information. Please post a link to the computer it is running on. If the task is still running, please post the output of the following command: pstree -p $(awk '{print $2}' <(grep runc <(grep stderr.txt <(lsof +D <path_to_the_slot_number_the_task_is_running_in> 2>/dev/null)))) replace "<path_to...>" with the path to the ... |
Send message Joined: 8 Dec 19 Posts: 37 Credit: 7,587,438 RAC: 0 |
The computer is: https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10816268 The task is (still running): https://lhcathome.cern.ch/lhcathome/result.php?resultid=368455572 The output is: init(1)─┬─init(10)─┬─automount(97)─┬─{automount}(98) │ │ ├─{automount}(99) │ │ └─{automount}(102) │ ├─boinc(135)─┬─wrapper_2019_03(360)─┬─cranky-0.0.32(376)───runc(6647)─┬─job(6771)───runRivet.sh(6912)─┬─rivetvm.exe(10046) │ │ │ │ │ ├─rungen.sh(10045)───python(10405)───python(10409)─┬─ajob1(2185)───madevent_mintMC(21+ │ │ │ │ │ │ ├─ajob1(2544)───madevent_mintMC(25+ │ │ │ │ │ │ ├─{python}(10476) │ │ │ │ │ │ ├─{python}(23250) │ │ │ │ │ │ ├─{python}(23251) │ │ │ │ │ │ ├─{python}(25059) │ │ │ │ │ │ ├─{python}(25060) │ │ │ │ │ │ ├─{python}(24020) │ │ │ │ │ │ └─{python}(24021) │ │ │ │ │ └─sleep(2746) │ │ │ │ ├─{runc}(6676) │ │ │ │ ├─{runc}(6678) │ │ │ │ ├─{runc}(6679) │ │ │ │ ├─{runc}(6687) │ │ │ │ ├─{runc}(6692) │ │ │ │ ├─{runc}(6706) │ │ │ │ ├─{runc}(6715) │ │ │ │ └─{runc}(6721) │ │ │ └─{wrapper_2019_03}(370) │ │ └─{boinc}(359) │ ├─cvmfs2(938) │ ├─cvmfs2(941) │ ├─cvmfs2(945)─┬─{cvmfs2}(950) │ │ ├─{cvmfs2}(951) │ │ ├─{cvmfs2}(952) │ │ ├─{cvmfs2}(953) │ │ ├─{cvmfs2}(954) │ │ ├─{cvmfs2}(955) │ │ ├─{cvmfs2}(956) │ │ ├─{cvmfs2}(957) │ │ ├─{cvmfs2}(958) │ │ ├─{cvmfs2}(959) │ │ ├─{cvmfs2}(960) │ │ ├─{cvmfs2}(961) │ │ ├─{cvmfs2}(962) │ │ ├─{cvmfs2}(963) │ │ ├─{cvmfs2}(964) │ │ ├─{cvmfs2}(965) │ │ ├─{cvmfs2}(966) │ │ ├─{cvmfs2}(967) │ │ ├─{cvmfs2}(968) │ │ └─{cvmfs2}(971) │ ├─cvmfs2(949) │ ├─cvmfs2(1120)─┬─{cvmfs2}(1125) │ │ ├─{cvmfs2}(1126) │ │ ├─{cvmfs2}(1127) │ │ ├─{cvmfs2}(1128) │ │ ├─{cvmfs2}(1129) │ │ ├─{cvmfs2}(1130) │ │ ├─{cvmfs2}(1131) │ │ ├─{cvmfs2}(1132) │ │ ├─{cvmfs2}(1133) │ │ ├─{cvmfs2}(1134) │ │ ├─{cvmfs2}(10200) │ │ ├─{cvmfs2}(10248) │ │ ├─{cvmfs2}(10311) │ │ ├─{cvmfs2}(10313) │ │ ├─{cvmfs2}(10315) │ │ ├─{cvmfs2}(10407) │ │ ├─{cvmfs2}(10408) │ │ ├─{cvmfs2}(10412) │ │ ├─{cvmfs2}(10422) │ │ └─{cvmfs2}(10427) │ ├─cvmfs2(1124) │ ├─cvmfs2(2477)─┬─{cvmfs2}(2482) │ │ ├─{cvmfs2}(2483) │ │ ├─{cvmfs2}(2484) │ │ ├─{cvmfs2}(2485) │ │ ├─{cvmfs2}(2486) │ │ ├─{cvmfs2}(2487) │ │ ├─{cvmfs2}(2488) │ │ ├─{cvmfs2}(2489) │ │ ├─{cvmfs2}(2490) │ │ ├─{cvmfs2}(2491) │ │ ├─{cvmfs2}(2492) │ │ ├─{cvmfs2}(2494) │ │ ├─{cvmfs2}(2495) │ │ ├─{cvmfs2}(2496) │ │ ├─{cvmfs2}(2497) │ │ ├─{cvmfs2}(2498) │ │ ├─{cvmfs2}(2499) │ │ ├─{cvmfs2}(2500) │ │ ├─{cvmfs2}(2501) │ │ └─{cvmfs2}(6094) │ ├─cvmfs2(2481) │ ├─cvmfs2(3828)─┬─{cvmfs2}(3833) │ │ ├─{cvmfs2}(3834) │ │ ├─{cvmfs2}(3835) │ │ ├─{cvmfs2}(3836) │ │ ├─{cvmfs2}(3837) │ │ ├─{cvmfs2}(3838) │ │ ├─{cvmfs2}(3839) │ │ ├─{cvmfs2}(3840) │ │ ├─{cvmfs2}(3841) │ │ ├─{cvmfs2}(3842) │ │ ├─{cvmfs2}(3843) │ │ ├─{cvmfs2}(3844) │ │ ├─{cvmfs2}(3845) │ │ ├─{cvmfs2}(3846) │ │ ├─{cvmfs2}(3847) │ │ ├─{cvmfs2}(3848) │ │ ├─{cvmfs2}(3849) │ │ ├─{cvmfs2}(3850) │ │ ├─{cvmfs2}(3851) │ │ └─{cvmfs2}(6841) │ ├─cvmfs2(3832) │ └─squid(77)───squid(81)─┬─pinger(110) │ ├─{squid}(10183) │ ├─{squid}(10184) │ ├─{squid}(10185) │ ├─{squid}(10186) │ ├─{squid}(10187) │ ├─{squid}(10188) │ ├─{squid}(10189) │ ├─{squid}(10190) │ ├─{squid}(10191) │ ├─{squid}(10192) │ ├─{squid}(10193) │ ├─{squid}(10194) │ ├─{squid}(10195) │ ├─{squid}(10196) │ ├─{squid}(10197) │ └─{squid}(10198) ├─init(30523)───init(30524)───bash(30525)───pstree(2755) └─{init}(7) |
Send message Joined: 15 Jun 08 Posts: 2519 Credit: 250,934,990 RAC: 127,970 |
OK Looks like the oneliner didn't return the expected output. If you are currently running only 1 Theory task you'll get it's process tree with a reduced command: pstree -p pid From your last post, pid might be 6647. Please post what you get from this command: pstree -p 6647 What I'd like to check is how many madgraph processes are in this (sub)tree. They should be limited to 2 but at least not more than 8 (see below). Your "host" appears to be an 8-core WSL guest on a Windows computer. The i7-4790 CPU also reports 8 cores. It is usually not recommended to spend all available cores to a single guest. |
Send message Joined: 8 Dec 19 Posts: 37 Credit: 7,587,438 RAC: 0 |
Yes, BOINC is only running 1 Theory task even though it should be running 8. The output is: runc(6647)─┬─job(6771)───runRivet.sh(6912)─┬─rivetvm.exe(10046) │ ├─rungen.sh(10045)───python(10405)───python(10409)─┬─ajob1(8326)───madevent_mintMC(8335) │ │ ├─ajob1(8351)───madevent_mintMC(8360) │ │ ├─{python}(10476) │ │ ├─{python}(23250) │ │ ├─{python}(23251) │ │ ├─{python}(25059) │ │ ├─{python}(25060) │ │ ├─{python}(24020) │ │ └─{python}(24021) │ └─sleep(8418) ├─{runc}(6676) ├─{runc}(6678) ├─{runc}(6679) ├─{runc}(6687) ├─{runc}(6692) ├─{runc}(6706) ├─{runc}(6715) └─{runc}(6721) Are what you're looking for the 2 madevent_mintMC entries on far right at top of tree? They were a bit truncated in the first output. I haven't had issues with stability or noticeable slowdowns by allowing WSL to use all cores, except when RAM gets filled up, e.g. when running a bunch of concurrent ATLAS tasks. The i7 PC I pretty much only use for BOINC. |
©2024 CERN