Message boards : Theory Application : Extreme Overload caused by a Theory Task
Message board moderation

To post messages, you must log in.

AuthorMessage
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1428
Credit: 73,025,506
RAC: 105,776
Message 41736 - Posted: 27 Feb 2020, 12:35:00 UTC

Just killed a task that caused an extreme overload caused by lots of child processes.
Could this be a configuration error?


08:02:55 (71261): wrapper (7.15.26016): starting
08:02:55 (71261): wrapper (7.15.26016): starting
08:02:55 (71261): wrapper: running ../../projects/lhcathome.cern.ch_lhcathome/cranky-0.0.31 ()
08:02:55 CET +01:00 2020-02-27: cranky-0.0.31: [INFO] Detected Theory App
08:02:55 CET +01:00 2020-02-27: cranky-0.0.31: [INFO] Checking CVMFS.
08:03:04 CET +01:00 2020-02-27: cranky-0.0.31: [INFO] Checking runc.
08:03:04 CET +01:00 2020-02-27: cranky-0.0.31: [INFO] Creating the filesystem.
08:03:05 CET +01:00 2020-02-27: cranky-0.0.31: [INFO] Using /cvmfs/cernvm-prod.cern.ch/cvm3
08:03:05 CET +01:00 2020-02-27: cranky-0.0.31: [INFO] Updating config.json.
08:03:05 CET +01:00 2020-02-27: cranky-0.0.31: [INFO] Running Container 'runc'.
08:03:08 CET +01:00 2020-02-27: cranky-0.0.31: [INFO] ===> [runRivet] Thu Feb 27 07:03:07 UTC 2020 [boinc pp zinclusive 7000 -,-,50,130 - madgraph5amc 2.6.2.atlas nlo2jet 100000 42]




Output from pstree:
runc(72593)─┬─job(72615)───runRivet.sh(72672)─┬─rivetvm.exe(74932)
            │                                 ├─rungen.sh(74931)───python(76440)───python(76512)─┬─ajob1(362)───madevent_mintMC(508)
            │                                 │                                                  ├─ajob1(796)───madevent_mintMC(1001)
            │                                 │                                                  ├─ajob1(5993)───madevent_mintMC(6482)
            │                                 │                                                  ├─ajob1(6099)───madevent_mintMC(6790)
            │                                 │                                                  ├─ajob1(8332)───madevent_mintMC(8587)
            │                                 │                                                  ├─ajob1(8983)───madevent_mintMC(9247)
            │                                 │                                                  ├─ajob1(11338)───madevent_mintMC(11651)
            │                                 │                                                  ├─ajob1(11367)───madevent_mintMC(11593)
            │                                 │                                                  ├─ajob1(12360)───madevent_mintMC(12644)
            │                                 │                                                  ├─ajob1(12519)───madevent_mintMC(12767)
            │                                 │                                                  ├─ajob1(13000)───madevent_mintMC(13308)
            │                                 │                                                  ├─ajob1(13192)───madevent_mintMC(13377)
            │                                 │                                                  ├─ajob1(13865)───madevent_mintMC(14117)
            │                                 │                                                  ├─ajob1(14173)───madevent_mintMC(14419)
            │                                 │                                                  ├─ajob1(14576)───madevent_mintMC(14937)
            │                                 │                                                  ├─ajob1(16732)───madevent_mintMC(16967)
            │                                 │                                                  ├─ajob1(16975)───madevent_mintMC(17237)
            │                                 │                                                  ├─ajob1(18824)───madevent_mintMC(19014)
            │                                 │                                                  ├─ajob1(38347)───madevent_mintMC(38702)
            │                                 │                                                  ├─ajob1(40479)───madevent_mintMC(40723)
            │                                 │                                                  ├─ajob1(74262)───madevent_mintMC(74533)
            │                                 │                                                  ├─ajob1(77566)───madevent_mintMC(77734)
            │                                 │                                                  ├─ajob1(99253)───madevent_mintMC(99488)
            │                                 │                                                  ├─ajob1(101708)───madevent_mintMC(101995)
            │                                 │                                                  ├─ajob1(102192)───madevent_mintMC(102431)
            │                                 │                                                  ├─ajob1(103858)───madevent_mintMC(104120)
            │                                 │                                                  ├─ajob1(107084)───madevent_mintMC(107228)
            │                                 │                                                  ├─ajob1(117794)───madevent_mintMC(118021)
            │                                 │                                                  ├─ajob1(118254)───madevent_mintMC(118547)
            │                                 │                                                  ├─ajob1(121948)───madevent_mintMC(122287)
            │                                 │                                                  ├─ajob1(129084)───madevent_mintMC(129321)
            │                                 │                                                  ├─ajob1(130472)───madevent_mintMC(130704)
            │                                 │                                                  ├─{python}(77931)
            │                                 │                                                  ├─{python}(114243)
            │                                 │                                                  ├─{python}(114244)
            │                                 │                                                  ├─{python}(114245)
            │                                 │                                                  ├─{python}(114246)
            │                                 │                                                  ├─{python}(114247)
            │                                 │                                                  ├─{python}(114248)
            │                                 │                                                  ├─{python}(114250)
            │                                 │                                                  ├─{python}(114252)
            │                                 │                                                  ├─{python}(114253)
            │                                 │                                                  ├─{python}(114254)
            │                                 │                                                  ├─{python}(114255)
            │                                 │                                                  ├─{python}(114256)
            │                                 │                                                  ├─{python}(114257)
            │                                 │                                                  ├─{python}(114258)
            │                                 │                                                  ├─{python}(114259)
            │                                 │                                                  ├─{python}(114261)
            │                                 │                                                  ├─{python}(114262)
            │                                 │                                                  ├─{python}(114267)
            │                                 │                                                  ├─{python}(114268)
            │                                 │                                                  ├─{python}(114269)
            │                                 │                                                  ├─{python}(114272)
            │                                 │                                                  ├─{python}(114274)
            │                                 │                                                  ├─{python}(114276)
            │                                 │                                                  ├─{python}(114283)
            │                                 │                                                  ├─{python}(114284)
            │                                 │                                                  ├─{python}(114285)
            │                                 │                                                  ├─{python}(114286)
            │                                 │                                                  ├─{python}(114287)
            │                                 │                                                  ├─{python}(114288)
            │                                 │                                                  ├─{python}(114289)
            │                                 │                                                  ├─{python}(114291)
            │                                 │                                                  ├─{python}(114294)
            │                                 │                                                  ├─{python}(114295)
            │                                 │                                                  ├─{python}(114296)
            │                                 │                                                  ├─{python}(114297)
            │                                 │                                                  ├─{python}(114298)
            │                                 │                                                  ├─{python}(114300)
            │                                 │                                                  ├─{python}(114301)
            │                                 │                                                  ├─{python}(114302)
            │                                 │                                                  ├─{python}(114303)
            │                                 │                                                  ├─{python}(114304)
            │                                 │                                                  ├─{python}(114307)
            │                                 │                                                  ├─{python}(114308)
            │                                 │                                                  ├─{python}(114309)
            │                                 │                                                  ├─{python}(114311)
            │                                 │                                                  ├─{python}(114313)
            │                                 │                                                  ├─{python}(114314)
            │                                 │                                                  ├─{python}(114316)
            │                                 │                                                  ├─{python}(114317)
            │                                 │                                                  ├─{python}(114318)
            │                                 │                                                  ├─{python}(114319)
            │                                 │                                                  ├─{python}(114320)
            │                                 │                                                  ├─{python}(114321)
            │                                 │                                                  ├─{python}(114322)
            │                                 │                                                  ├─{python}(114323)
            │                                 │                                                  ├─{python}(114324)
            │                                 │                                                  ├─{python}(114325)
            │                                 │                                                  ├─{python}(114326)
            │                                 │                                                  ├─{python}(114327)
            │                                 │                                                  ├─{python}(114328)
            │                                 │                                                  ├─{python}(114329)
            │                                 │                                                  ├─{python}(114330)
            │                                 │                                                  ├─{python}(114331)
            │                                 │                                                  ├─{python}(114332)
            │                                 │                                                  ├─{python}(12377)
            │                                 │                                                  ├─{python}(12379)
            │                                 │                                                  ├─{python}(12380)
            │                                 │                                                  ├─{python}(12381)
            │                                 │                                                  ├─{python}(12382)
            │                                 │                                                  ├─{python}(12383)
            │                                 │                                                  ├─{python}(12386)
            │                                 │                                                  ├─{python}(12387)
            │                                 │                                                  ├─{python}(12388)
            │                                 │                                                  ├─{python}(12389)
            │                                 │                                                  ├─{python}(12390)
            │                                 │                                                  ├─{python}(12391)
            │                                 │                                                  ├─{python}(12392)
            │                                 │                                                  ├─{python}(12393)
            │                                 │                                                  ├─{python}(12394)
            │                                 │                                                  ├─{python}(12396)
            │                                 │                                                  ├─{python}(12397)
            │                                 │                                                  ├─{python}(12398)
            │                                 │                                                  ├─{python}(12399)
            │                                 │                                                  ├─{python}(12400)
            │                                 │                                                  ├─{python}(12407)
            │                                 │                                                  ├─{python}(12409)
            │                                 │                                                  ├─{python}(12410)
            │                                 │                                                  ├─{python}(12411)
            │                                 │                                                  ├─{python}(12412)
            │                                 │                                                  ├─{python}(12415)
            │                                 │                                                  ├─{python}(12419)
            │                                 │                                                  ├─{python}(12420)
            │                                 │                                                  ├─{python}(12421)
            │                                 │                                                  ├─{python}(12425)
            │                                 │                                                  ├─{python}(12426)
            │                                 │                                                  ├─{python}(12428)
            │                                 │                                                  ├─{python}(76120)
            │                                 │                                                  ├─{python}(76121)
            │                                 │                                                  ├─{python}(76122)
            │                                 │                                                  ├─{python}(76125)
            │                                 │                                                  ├─{python}(76126)
            │                                 │                                                  ├─{python}(76127)
            │                                 │                                                  ├─{python}(76132)
            │                                 │                                                  ├─{python}(76135)
            │                                 │                                                  ├─{python}(76136)
            │                                 │                                                  ├─{python}(76137)
            │                                 │                                                  ├─{python}(76138)
            │                                 │                                                  ├─{python}(76140)
            │                                 │                                                  ├─{python}(76145)
            │                                 │                                                  ├─{python}(76148)
            │                                 │                                                  ├─{python}(76149)
            │                                 │                                                  ├─{python}(76151)
            │                                 │                                                  ├─{python}(76153)
            │                                 │                                                  ├─{python}(76155)
            │                                 │                                                  ├─{python}(76156)
            │                                 │                                                  ├─{python}(76157)
            │                                 │                                                  ├─{python}(76158)
            │                                 │                                                  ├─{python}(76160)
            │                                 │                                                  ├─{python}(76161)
            │                                 │                                                  ├─{python}(76164)
            │                                 │                                                  ├─{python}(76166)
            │                                 │                                                  ├─{python}(76167)
            │                                 │                                                  ├─{python}(76169)
            │                                 │                                                  ├─{python}(76174)
            │                                 │                                                  ├─{python}(76175)
            │                                 │                                                  ├─{python}(76181)
            │                                 │                                                  ├─{python}(76183)
            │                                 │                                                  └─{python}(76185)
            │                                 └─sleep(19147)
            ├─{runc}(72596)
            ├─{runc}(72597)
            ├─{runc}(72598)
            ├─{runc}(72599)
            ├─{runc}(72600)
            ├─{runc}(72602)
            ├─{runc}(72607)
            └─{runc}(72627)
ID: 41736 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 934
Credit: 6,286,290
RAC: 664
Message 41737 - Posted: 27 Feb 2020, 13:44:25 UTC - in response to Message 41736.  

Just killed a task that caused an extreme overload caused by lots of child processes.
Could this be a configuration error?
Before the disk_bound extension this type of tasks caused exceeded disk limits.
See also: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5266

168 of all tasks are with this madgraph5amc generator whereof 73 never succeed so far.
ID: 41737 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1428
Credit: 73,025,506
RAC: 105,776
Message 41986 - Posted: 23 Mar 2020, 7:58:30 UTC

Got again one of those "mad" Theory tasks
https://lhcathome.cern.ch/lhcathome/result.php?resultid=269060049
ppbar zinclusive 1800 -,-,50,130 - madgraph5amc 2.6.0.atlas nlo2jet

The following line can be found in runRivet.log:
INFO: Generated 232 subprocesses with 36320 real emission diagrams, 2560 born diagrams and 47392 virtual diagrams

It looks like all subprocesses are running concurrently which puts an extreme load on the host.

Like other Theory tasks this type should also respect the 1 core behavior and avoid running that many processes concurrently.
ID: 41986 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1428
Credit: 73,025,506
RAC: 105,776
Message 42393 - Posted: 10 May 2020, 6:43:00 UTC

Had to kill the next one.
https://lhcathome.cern.ch/lhcathome/result.php?resultid=272649942
===> [runRivet] Sat May  9 20:00:42 UTC 2020 [boinc pp zinclusive 7000 20,-,50,200 - madgraph5amc 2.6.5.atlas nlo2jet 100000 2]


Output from pstree:
runc(29966)─┬─job(29981)───runRivet.sh(30020)─┬─rivetvm.exe(35057)
            │                                 ├─rungen.sh(35056)───python(38664)───python(38994)─┬─ajob1(2788)───madevent_mintMC(3078)
            │                                 │                                                  ├─ajob1(3053)───madevent_mintMC(3609)
            │                                 │                                                  ├─ajob1(3802)───madevent_mintMC(4288)
            │                                 │                                                  ├─ajob1(5711)───madevent_mintMC(6044)
            │                                 │                                                  ├─ajob1(6618)───madevent_mintMC(6971)
            │                                 │                                                  ├─ajob1(8404)───madevent_mintMC(8774)
            │                                 │                                                  ├─ajob1(8425)───madevent_mintMC(8764)
            │                                 │                                                  ├─ajob1(8859)───madevent_mintMC(9381)
            │                                 │                                                  ├─ajob1(10892)───madevent_mintMC(11141)
            │                                 │                                                  ├─ajob1(10980)───madevent_mintMC(11225)
            │                                 │                                                  ├─ajob1(14602)───madevent_mintMC(15066)
            │                                 │                                                  ├─ajob1(15041)───madevent_mintMC(15428)
            │                                 │                                                  ├─ajob1(22774)───madevent_mintMC(23151)
            │                                 │                                                  ├─ajob1(25830)───madevent_mintMC(26082)
            │                                 │                                                  ├─ajob1(26749)───madevent_mintMC(26981)
            │                                 │                                                  ├─ajob1(71074)───madevent_mintMC(71273)
            │                                 │                                                  ├─ajob1(74652)───madevent_mintMC(75090)
            │                                 │                                                  ├─ajob1(77463)───madevent_mintMC(77716)
            │                                 │                                                  ├─ajob1(83957)───madevent_mintMC(84239)
            │                                 │                                                  ├─ajob1(85033)───madevent_mintMC(85257)
            │                                 │                                                  ├─ajob1(85063)───madevent_mintMC(85275)
            │                                 │                                                  ├─ajob1(85411)───madevent_mintMC(85646)
            │                                 │                                                  ├─ajob1(93345)───madevent_mintMC(93760)
            │                                 │                                                  ├─ajob1(98350)───madevent_mintMC(98648)
            │                                 │                                                  ├─ajob1(99279)───madevent_mintMC(99542)
            │                                 │                                                  ├─ajob1(105494)───madevent_mintMC(105773)
            │                                 │                                                  ├─ajob1(114285)───madevent_mintMC(114609)
            │                                 │                                                  ├─ajob1(122438)───madevent_mintMC(122825)
            │                                 │                                                  ├─ajob1(124210)───madevent_mintMC(124616)
            │                                 │                                                  ├─ajob1(124648)───madevent_mintMC(124896)
            │                                 │                                                  ├─ajob1(125653)───madevent_mintMC(125842)
            │                                 │                                                  ├─ajob1(129154)───madevent_mintMC(129457)
            │                                 │                                                  ├─{python}(42664)
            │                                 │                                                  ├─{python}(57884)
            │                                 │                                                  ├─{python}(57885)
            │                                 │                                                  ├─{python}(57889)
            │                                 │                                                  ├─{python}(57893)
            │                                 │                                                  ├─{python}(57894)
            │                                 │                                                  ├─{python}(57895)
            │                                 │                                                  ├─{python}(57896)
            │                                 │                                                  ├─{python}(57901)
            │                                 │                                                  ├─{python}(57902)
            │                                 │                                                  ├─{python}(57903)
            │                                 │                                                  ├─{python}(57906)
            │                                 │                                                  ├─{python}(57913)
            │                                 │                                                  ├─{python}(57916)
            │                                 │                                                  ├─{python}(57919)
            │                                 │                                                  ├─{python}(57922)
            │                                 │                                                  ├─{python}(57923)
            │                                 │                                                  ├─{python}(57924)
            │                                 │                                                  ├─{python}(57925)
            │                                 │                                                  ├─{python}(57926)
            │                                 │                                                  ├─{python}(57927)
            │                                 │                                                  ├─{python}(57928)
            │                                 │                                                  ├─{python}(57929)
            │                                 │                                                  ├─{python}(57930)
            │                                 │                                                  ├─{python}(57931)
            │                                 │                                                  ├─{python}(57932)
            │                                 │                                                  ├─{python}(57934)
            │                                 │                                                  ├─{python}(57935)
            │                                 │                                                  ├─{python}(57936)
            │                                 │                                                  ├─{python}(57937)
            │                                 │                                                  ├─{python}(57938)
            │                                 │                                                  ├─{python}(57939)
            │                                 │                                                  ├─{python}(57940)
            │                                 │                                                  ├─{python}(81110)
            │                                 │                                                  ├─{python}(81113)
            │                                 │                                                  ├─{python}(81114)
            │                                 │                                                  ├─{python}(81115)
            │                                 │                                                  ├─{python}(81116)
            │                                 │                                                  ├─{python}(81117)
            │                                 │                                                  ├─{python}(81118)
            │                                 │                                                  ├─{python}(81119)
            │                                 │                                                  ├─{python}(81120)
            │                                 │                                                  ├─{python}(81121)
            │                                 │                                                  ├─{python}(81122)
            │                                 │                                                  ├─{python}(81123)
            │                                 │                                                  ├─{python}(81124)
            │                                 │                                                  ├─{python}(81125)
            │                                 │                                                  ├─{python}(81126)
            │                                 │                                                  ├─{python}(81127)
            │                                 │                                                  ├─{python}(81128)
            │                                 │                                                  ├─{python}(81129)
            │                                 │                                                  ├─{python}(81130)
            │                                 │                                                  ├─{python}(81131)
            │                                 │                                                  ├─{python}(81132)
            │                                 │                                                  ├─{python}(81133)
            │                                 │                                                  ├─{python}(81134)
            │                                 │                                                  ├─{python}(81135)
            │                                 │                                                  ├─{python}(81136)
            │                                 │                                                  ├─{python}(81137)
            │                                 │                                                  ├─{python}(81138)
            │                                 │                                                  ├─{python}(81139)
            │                                 │                                                  ├─{python}(81140)
            │                                 │                                                  ├─{python}(81141)
            │                                 │                                                  ├─{python}(81142)
            │                                 │                                                  ├─{python}(81143)
            │                                 │                                                  ├─{python}(130813)
            │                                 │                                                  ├─{python}(130815)
            │                                 │                                                  ├─{python}(130816)
            │                                 │                                                  ├─{python}(130817)
            │                                 │                                                  ├─{python}(130818)
            │                                 │                                                  ├─{python}(130819)
            │                                 │                                                  ├─{python}(130820)
            │                                 │                                                  ├─{python}(130822)
            │                                 │                                                  ├─{python}(130823)
            │                                 │                                                  ├─{python}(130825)
            │                                 │                                                  ├─{python}(130826)
            │                                 │                                                  ├─{python}(130828)
            │                                 │                                                  ├─{python}(130829)
            │                                 │                                                  ├─{python}(130830)
            │                                 │                                                  ├─{python}(130833)
            │                                 │                                                  ├─{python}(130835)
            │                                 │                                                  ├─{python}(130836)
            │                                 │                                                  ├─{python}(130837)
            │                                 │                                                  ├─{python}(130838)
            │                                 │                                                  ├─{python}(130839)
            │                                 │                                                  ├─{python}(130840)
            │                                 │                                                  ├─{python}(130841)
            │                                 │                                                  ├─{python}(130842)
            │                                 │                                                  ├─{python}(130844)
            │                                 │                                                  ├─{python}(130845)
            │                                 │                                                  ├─{python}(130846)
            │                                 │                                                  ├─{python}(130848)
            │                                 │                                                  ├─{python}(130849)
            │                                 │                                                  ├─{python}(130850)
            │                                 │                                                  ├─{python}(130851)
            │                                 │                                                  ├─{python}(130852)
            │                                 │                                                  └─{python}(130853)
            │                                 └─sleep(28012)
            ├─{runc}(29970)
            ├─{runc}(29971)
            ├─{runc}(29972)
            ├─{runc}(29973)
            ├─{runc}(29975)
            ├─{runc}(29977)
            ├─{runc}(29979)
            └─{runc}(29980)
ID: 42393 · Report as offensive     Reply Quote
CloverField

Send message
Joined: 17 Oct 06
Posts: 39
Credit: 9,663,717
RAC: 27,400
Message 42404 - Posted: 11 May 2020, 1:37:07 UTC

Could everything running concurrently be the reason that my computer crashes when its running 32 theory tasks at once?
ID: 42404 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1428
Credit: 73,025,506
RAC: 105,776
Message 42406 - Posted: 11 May 2020, 6:30:33 UTC - in response to Message 42404.  

The error mentioned in this thread so far affected only madgraph tasks.
Only 0.25 % of all Theory runs currently listed "active" in mcplots are configured to use madgraph.
I guess vbox tasks are not affected as the processes inside the VM see only 1 core.

To get help regarding more general issues you may post a more detailed description, preferrably in a fresh thread.
ID: 42406 · Report as offensive     Reply Quote
Henry Nebrensky

Send message
Joined: 13 Jul 05
Posts: 116
Credit: 12,913,111
RAC: 20,396
Message 42414 - Posted: 11 May 2020, 13:41:44 UTC - in response to Message 41986.  

It looks like all subprocesses are running concurrently which puts an extreme load on the host.

Like other Theory tasks this type should also respect the 1 core behavior and avoid running that many processes concurrently.
I've stumbled on 272677122 which has ATM taken over all 8 cores on the host (but for a loadaverage ~8); so not overwhelming the host completely but certainly pushing out all other BOINC tasks. top reports the master python process is taking up 69.3% of the memory!

===> [runRivet] Sun May 10 16:33:34 UTC 2020 [boinc pp zinclusive 7000 20,-,50,200 - madgraph5amc 2.6.6.atlas nlo2jet 100000 2]

> grep subprocess /var/lib/boinc/slots/2/cernvm/shared/runRivet.log
INFO: Generated 16 subprocesses with 192 real emission diagrams, 32 born diagrams and 32 virtual diagrams 
INFO: Generated 48 subprocesses with 2944 real emission diagrams, 192 born diagrams and 1440 virtual diagrams 
INFO: Generated 232 subprocesses with 36320 real emission diagrams, 2560 born diagrams and 47392 virtual diagrams 

> pstree -c 28778
wrapper_2019_03─┬─cranky-0.0.31───runc─┬─job───runRivet.sh─┬─rivetvm.exe
                │                      │                   ├─rungen.sh───python───python─┬─ajob1───madevent_mintMC
                │                      │                   │                             ├─ajob1───madevent_mintMC
                │                      │                   │                             ├─ajob1───madevent_mintMC
                │                      │                   │                             ├─ajob1───madevent_mintMC
                │                      │                   │                             ├─ajob1───madevent_mintMC
                │                      │                   │                             ├─ajob1───madevent_mintMC
                │                      │                   │                             ├─ajob1───madevent_mintMC
                │                      │                   │                             ├─ajob1───madevent_mintMC
                │                      │                   │                             ├─{python}
                │                      │                   │                             ├─{python}
                │                      │                   │                             ├─{python}
                │                      │                   │                             ├─{python}
                │                      │                   │                             ├─{python}
                │                      │                   │                             ├─{python}
                │                      │                   │                             ├─{python}
                │                      │                   │                             ├─{python}
                │                      │                   │                             ├─{python}
                │                      │                   │                             ├─{python}
                │                      │                   │                             ├─{python}
                │                      │                   │                             ├─{python}
                │                      │                   │                             ├─{python}
                │                      │                   │                             ├─{python}
                │                      │                   │                             ├─{python}
                │                      │                   │                             ├─{python}
                │                      │                   │                             ├─{python}
                │                      │                   │                             ├─{python}
                │                      │                   │                             ├─{python}
                │                      │                   │                             ├─{python}
                │                      │                   │                             ├─{python}
                │                      │                   │                             ├─{python}
                │                      │                   │                             ├─{python}
                │                      │                   │                             ├─{python}
                │                      │                   │                             └─{python}
                │                      │                   └─sleep
                │                      ├─{runc}
                │                      ├─{runc}
                │                      ├─{runc}
                │                      ├─{runc}
                │                      ├─{runc}
                │                      ├─{runc}
                │                      ├─{runc}
                │                      └─{runc}
                └─{wrapper_2019_03}

Log file presently ends with
NFO:  Idle: 0,  Running: 3,  Completed: 444 [  22m 15s  ] 
INFO:  Idle: 0,  Running: 2,  Completed: 445 [  22m 15s  ] 
INFO:  Idle: 0,  Running: 0,  Completed: 447 [  22m 15s  ] 
INFO:    Doing reweight 
INFO:  Idle: 0,  Running: 2,  Completed: 445 [ current time: 13h32 ] 
INFO:  Idle: 0,  Running: 1,  Completed: 446 [  0.12s  ] 
INFO:  Idle: 0,  Running: 0,  Completed: 447 [  0.74s  ] 
INFO: Collecting events 
and doesn't give any clear indication on progress - it's completed 447 out of... 232? 2560? what?
ID: 42414 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1428
Credit: 73,025,506
RAC: 105,776
Message 42415 - Posted: 11 May 2020, 14:27:18 UTC - in response to Message 42414.  

I guess the BOINC client still treats the task as 1-core.
Worst case (on an 8 core CPU) would be that BOINC starts 8 of them concurrently and the load average jumps to 8*8=64 (plus normal work).
ID: 42415 · Report as offensive     Reply Quote
Henry Nebrensky

Send message
Joined: 13 Jul 05
Posts: 116
Credit: 12,913,111
RAC: 20,396
Message 42416 - Posted: 11 May 2020, 15:45:32 UTC - in response to Message 42415.  
Last modified: 11 May 2020, 16:01:59 UTC

I guess the BOINC client still treats the task as 1-core.
Yes: boinccmd --get_tasks reported
   name: Theory_2390-1152716-2_0
   WU name: Theory_2390-1152716-2
   project URL: https://lhcathome.cern.ch/lhcathome/
   received: Sun May 10 13:17:48 2020
   report deadline: Wed May 20 13:17:47 2020
   ready to report: no
   state: downloaded
   scheduler state: scheduled
   active_task_state: EXECUTING
   app version num: 30005
   resources: 1 CPU
   estimated CPU time remaining: 1.650947
   CPU time at last checkpoint: 0.000000
   current CPU time: 20120.640000
   fraction done: 0.999978
   swap size: 8667 MB
   working set size: 5558 MB
but it started (eventually) 8 active processes, and the BOINC client was sensible and didn't start any more tasks as the existing ones finished off until the madgraph task completed:
15835: 11-May-2020 15:04:21 (low) [LHC@home] Starting task RAfMDmsbmrwnsSi4apGgGQJmABFKDmABFKDmBVrYDmABFKDmcy862n_0
15836: 11-May-2020 15:04:21 (low) [LHC@home] Starting task Theory_2390-1102868-2_0
15837: 11-May-2020 15:04:21 (low) [LHC@home] Starting task Theory_2390-1146431-2_1
15838: 11-May-2020 15:04:22 (low) [LHC@home] Starting task Theory_2390-1087074-2_0
15839: 11-May-2020 15:26:22 (low) [LHC@home] Computation for task Theory_2390-1152716-2_0 finished
15840: 11-May-2020 15:26:22 (low) [LHC@home] Starting task Theory_2390-1113717-2_0
15841: 11-May-2020 15:26:24 (low) [LHC@home] Started upload of Theory_2390-1152716-2_0_r1750715384_result
15842: 11-May-2020 15:26:29 (low) [LHC@home] Finished upload of Theory_2390-1152716-2_0_r1750715384_result

Worst case (on an 8 core CPU) would be that BOINC starts 8 of them concurrently and the load average jumps to 8*8=64 (plus normal work).
Maybe: I didn't check the load last night when it would still have been fighting with multi-core Atlas tasks. It looks like the BOINC client is trying to do the right thing, but - as you've pointed out below - the tasks themselves should be running the subtasks in series, not parallel.
ID: 42416 · Report as offensive     Reply Quote
Henry Nebrensky

Send message
Joined: 13 Jul 05
Posts: 116
Credit: 12,913,111
RAC: 20,396
Message 42417 - Posted: 11 May 2020, 16:36:46 UTC - in response to Message 42416.  

IMO it's also cheating us on the credit:

For single-threaded 272136516
02:58:17 (13728): cranky exited; CPU time 674129.550630
and 6,326.46 credit, i.e. 6.3k cr. for ~630k s CPU time. While for "multi-core" 272677122,
15:26:19 (28778): cranky exited; CPU time 447204.364571
and 657.71 credit, i.e. 0.7k cr. for 447k s CPU. I don't think my machines are a factor 5 different, but the Run times are!
ID: 42417 · Report as offensive     Reply Quote
Henry Nebrensky

Send message
Joined: 13 Jul 05
Posts: 116
Credit: 12,913,111
RAC: 20,396
Message 42483 - Posted: 14 May 2020, 14:29:51 UTC - in response to Message 42416.  

Another one: 273087449. boinccmd --get_tasks reports
2) -----------
   name: Theory_2390-1153380-3_0
   WU name: Theory_2390-1153380-3
   project URL: https://lhcathome.cern.ch/lhcathome/
   received: Thu May 14 00:45:03 2020
   report deadline: Sun May 24 00:45:02 2020
   ready to report: no
   state: downloaded
   scheduler state: scheduled
   active_task_state: EXECUTING
   app version num: 30006
   resources: 1 CPU
   estimated CPU time remaining: 0.017439
   slot: 1
   PID: 8741
   CPU time at last checkpoint: 0.000000
   current CPU time: 10407.640000
   fraction done: 1.000000
   swap size: 7842 MB
   working set size: 6124 MB
and pstree -c 8741 reports
wrapper_2019_03─┬─cranky-0.0.32───runc─┬─job───runRivet.sh─┬─rivetvm.exe
                │                      │                   ├─rungen.sh───python───python─┬─ajob1───madevent_mintMC
                │                      │                   │                             ├─ajob1───madevent_mintMC
                │                      │                   │                             ├─ajob1───madevent_mintMC
                │                      │                   │                             ├─ajob1───madevent_mintMC
                │                      │                   │                             ├─{python}
                │                      │                   │                             ├─{python}
                │                      │                   │                             ├─{python}
                │                      │                   │                             ├─{python}
                │                      │                   │                             ├─{python}
                │                      │                   │                             ├─{python}
                │                      │                   │                             ├─{python}
                │                      │                   │                             ├─{python}
                │                      │                   │                             ├─{python}
                │                      │                   │                             ├─{python}
                │                      │                   │                             ├─{python}
                │                      │                   │                             ├─{python}
                │                      │                   │                             └─{python}
                │                      │                   └─sleep
                │                      ├─{runc}
                │                      ├─{runc}
                │                      ├─{runc}
                │                      ├─{runc}
                │                      ├─{runc}
                │                      ├─{runc}
                │                      ├─{runc}
                │                      ├─{runc}
                │                      └─{runc}
                └─{wrapper_2019_03}

Again looks like the BOINC client is trying to do the right thing by not starting any new tasks to keep the load down to 4, but - as you've pointed out below - the tasks themselves should be running the subtasks in series, not parallel. This task has been hogging the entire machine for 12+ hours now.
ID: 42483 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1428
Credit: 73,025,506
RAC: 105,776
Message 42484 - Posted: 14 May 2020, 15:19:17 UTC - in response to Message 42483.  

In your example "wrapper_2019_03" is the process that is started by your BOINC client and everything below "runc" is hidden in a container controlled by runc.
Hence your BOINC client treats it as a 1 core task.

Other Theory tasks also run 2 main processes inside a runc container, rivetvm.exe and (e.g.) pythia and some of them also cause a minor overload. But as the pythia output is used as rivetvm input they automatically throttle each other.

In case of madevent it looks like the scripts inside the container do their own test regarding the CPU capabilities and set up as many threads as cores are reported.
I also don't see that the madevents get throttled.

It's a job for the team maintaining the scientific app to investigate and correct this behavior.
ID: 42484 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 934
Credit: 6,286,290
RAC: 664
Message 42515 - Posted: 16 May 2020, 13:01:46 UTC
Last modified: 16 May 2020, 13:14:33 UTC

For BOINC it seems OK, but not for science. https://lhcathome.cern.ch/lhcathome/result.php?resultid=273089431

===> [runRivet] Sat May 16 10:50:01 UTC 2020 [boinc pp zinclusive 7000 -,-,50,130 - madgraph5amc 2.7.2.atlas3 nlo1jet 100000 3]
after ~100 minutes run time on VBox:

ID: 42515 · Report as offensive     Reply Quote
Anton

Send message
Joined: 26 Nov 10
Posts: 8
Credit: 1,435,923
RAC: 0
Message 42519 - Posted: 16 May 2020, 15:18:07 UTC

Hi All,
Indeed the madgraph code default is to use all CPU cores.
This is corrected and the limit now is set to 2 cores max.
(It will take a few days until new jobs will arrive.)
ID: 42519 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1428
Credit: 73,025,506
RAC: 105,776
Message 42523 - Posted: 16 May 2020, 17:29:48 UTC - in response to Message 42519.  

Thanks a lot.
ID: 42523 · Report as offensive     Reply Quote

Message boards : Theory Application : Extreme Overload caused by a Theory Task


©2020 CERN