Message boards :
Theory Application :
MadGraph5
Message board moderation
Author | Message |
---|---|
Send message Joined: 2 May 07 Posts: 2277 Credit: 178,709,076 RAC: 100,489 ![]() ![]() |
Have a MadGraph5 Task running more than 40 hours so long. madgraph5amc 2.6.5.atlas nlo2jet - zinclusive 7000 -,-,50,130 MC Production matrix: 0+4/18 Is there a chance of finishing, runRivet.log is still growing, last line so long: INFO: Idle:185, Running: 2, Completed: 413 [ 35h 51 min] https://launchpad.net/mg5amcnlo |
Send message Joined: 13 Jul 05 Posts: 170 Credit: 15,020,549 RAC: 13 ![]() ![]() |
My guess is that the "idle" number is slowly reducing until "Completed" reaches 600 when either the task completes, or starts a whole new phase... Can you leave it for ~20hrs and see what happens? As long as the log file is growing then there's some grounds for optimism it'll finish OK. Back-stepping through the log file to the start of the current phase should tell you what it's trying to do in this phase. My experience with madgraph hasn't been been good - native it will run 2 cores forcing other tasks off the machine, Is there a chance of finishing, runRivet.log is still growing, last line so long: It also has significant stretches of not actually using CPU at all. We did have a thread about it some months back. |
Send message Joined: 2 May 07 Posts: 2277 Credit: 178,709,076 RAC: 100,489 ![]() ![]() |
Have found this thread you wrote - Extreme overload: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5323#41736 Have native Linux with ONE Cpu, but in the log is a entry to use two cpu's (set nb_core 2) How can this second Cpu being used? The running: is 2. Now 548 Completed and Idle: 50 (seem 600 is the max.) It would be nice if the Theory-task is reaching the 10 day limit, to get some points for this pain ;-). Edit: There are quite a lot Fontconfig error: Cannot load default config file |
Send message Joined: 13 Jul 05 Posts: 170 Credit: 15,020,549 RAC: 13 ![]() ![]() |
Have found this thread you wrote - Extreme overload: Same way as it used all 232 cores on computezrmle's machine! :( It'll just chuck processes at the OS and see what happens - isn't there a rivetvm.exe as well, or is that idle while madgraph does its multiprocessing thing? The running: is 2. Now 548 Completed and Idle: 50 (seem 600 is the max.) Looking back at that thread you might want to do a grep subprocess /var/lib/boinc/slots/?/cernvm/shared/runRivet.logto check that 600 is indeed the correct number (edit: just in case "idle" doesn't mean what I think it does). |
Send message Joined: 13 Jul 05 Posts: 170 Credit: 15,020,549 RAC: 13 ![]() ![]() |
It also has significant stretches of not actually using CPU at all.e.g I recently killed task 281349801 precisely because it was holding two cores but idle - it's reported as using just 50 mins in 20 hours :( We did have a thread (which maeax has kindly tracked down) about it some months back.This does remind me that I was going to complain there that even hard-wiring the coreness to two isn't really good enough - it should either be one, or else the WUs submitted to BOINC with a consistent #cores requirement. |
![]() ![]() Send message Joined: 24 Oct 04 Posts: 1234 Credit: 79,783,863 RAC: 76,928 ![]() ![]() |
I have never had any problems with MadGraph5 event generator tasks but it would take all day to go through all of the Valids to find just how long any of them ran and most of mine are probably from the other version of Theory 300.06 But I do watch ALL of mine start running and check the finished tasks and have saved examples of all the different event generator versions ( I have some epos and herwig7 and herwig++ running) and a few sherpa but mostly the pythia versions. I will get on one of my desktops and see what I have saved there later today if I get a chance. |
Send message Joined: 13 Jul 05 Posts: 170 Credit: 15,020,549 RAC: 13 ![]() ![]() |
True - there's a sampling feature in that I only check in rarely and follow up on tasks that look to be misbehaving. I also wonder if madgraph behaves better within a VM where it can't see any other cores. |
Send message Joined: 2 May 07 Posts: 2277 Credit: 178,709,076 RAC: 100,489 ![]() ![]() |
First round was ending after 600 events (49h 2m). The second is running for the moment. First line Computing upper envelope INFO: Idle:598, Running 2, Completed: 0 current time 16h17 (Thinking this is the time to finish the second round. Tomorrow morning seeing the next point - good night. |
Send message Joined: 2 May 07 Posts: 2277 Credit: 178,709,076 RAC: 100,489 ![]() ![]() |
Magic you are right. After: Laufzeit 3 Tage 5 Stunden 48 min. 40 sek. CPU Zeit 6 Tage 2 Stunden 0 min. The task finished successful. Don't understand why a second Cpu was used in a VM with one CPU defined. |
![]() ![]() Send message Joined: 24 Oct 04 Posts: 1234 Credit: 79,783,863 RAC: 76,928 ![]() ![]() |
Thanks maeax |
Send message Joined: 13 Jul 05 Posts: 170 Credit: 15,020,549 RAC: 13 ![]() ![]() |
Just aborted task 411417290 as runRivet.log showsIt also has significant stretches of not actually using CPU at all.e.g I recently killed task 281349801 precisely because it was holding two cores but idle - it's reported as using just 50 mins in 20 hours :( ===> [runRivet] Tue May 28 13:26:22 UTC 2024 [boinc pp zinclusive 13000 - - madgraph5amc 2.6.0.atlas nlo 100000 160]and then INFO: Result for check_poles: I'd have left it running to see what happens, but the machine will get powered off soon. Looks like internal fault detection isn't being passed up the chain and stopping the task, or did I just get too impatient? Edit: upshot was that the task was taking up a slot but using zero CPU - hence my revisiting an old thread about madgraph doing exactly that. |
![]() Send message Joined: 15 Jun 08 Posts: 2683 Credit: 286,885,049 RAC: 55,674 ![]() ![]() |
... or did I just get too impatient? No. It was OK to cancel that task. |
Send message Joined: 13 Jul 05 Posts: 170 Credit: 15,020,549 RAC: 13 ![]() ![]() |
Thanks... I wasn't sure if there was some clean-up check that would catch it, apart from the 10-day mop-up. I should have said earlier: upshot was that the task was taking up a slot but using zero CPU - hence my dredging up an old thread about madgraph doing exactly that. |
Send message Joined: 14 Jan 10 Posts: 1461 Credit: 9,859,151 RAC: 2,540 ![]() ![]() |
|
Send message Joined: 14 Jan 10 Posts: 1461 Credit: 9,859,151 RAC: 2,540 ![]() ![]() |
Running the problematic Herwig7 7.2.1 jobs, I came along this one: [INFO] ===> [runRivet] Thu Oct 17 08:05:45 UTC 2024 [boinc pp zinclusive 13000 -,-,350,- - madgraph5amc 2.6.1.atlas nlo2jet 100000 478] Because of the Herwig's I set up the Theory VMs to use 1536MB RAM (default 630MB) Monitoring the madgraph5, I noticed that the initial python process allocated up to 1.2GB non-swapped physical memory KiB Mem : 1519120 total, 63052 free, 1394524 used, 61544 buff/cache KiB Swap: 1048572 total, 192140 free, 856432 used. 772 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 7266 boinc 20 0 1974224 1.1g 276 D 0.0 78.2 26:16.72 python Edit: Finally ended with exit code 1 https://lhcathome.cern.ch/lhcathome/result.php?resultid=415015101 Guest Log: job: diskusage=23956 ???? Is this more than 20GB maximum allowed for the Theory VMs? |
Send message Joined: 2 May 07 Posts: 2277 Credit: 178,709,076 RAC: 100,489 ![]() ![]() |
This is the same problematic (Sherpa with no solution) and now Herwig. We need parameter to select this one selective. For multi-Core and running other project's parallel, we need a control for our side (Sherpa and Herwig). |
©2025 CERN