Message boards : Theory Application : MadGraph5
Message board moderation

To post messages, you must log in.

AuthorMessage
maeax

Send message
Joined: 2 May 07
Posts: 2151
Credit: 160,937,570
RAC: 52,740
Message 43261 - Posted: 24 Aug 2020, 3:21:40 UTC

Have a MadGraph5 Task running more than 40 hours so long.
madgraph5amc 2.6.5.atlas nlo2jet - zinclusive 7000 -,-,50,130
MC Production matrix: 0+4/18

Is there a chance of finishing, runRivet.log is still growing, last line so long:
INFO: Idle:185, Running: 2, Completed: 413 [ 35h 51 min]

https://launchpad.net/mg5amcnlo
ID: 43261 · Report as offensive     Reply Quote
Henry Nebrensky

Send message
Joined: 13 Jul 05
Posts: 169
Credit: 14,965,342
RAC: 1,411
Message 43268 - Posted: 24 Aug 2020, 12:26:35 UTC - in response to Message 43261.  

My guess is that the "idle" number is slowly reducing until "Completed" reaches 600 when either the task completes, or starts a whole new phase...
Can you leave it for ~20hrs and see what happens? As long as the log file is growing then there's some grounds for optimism it'll finish OK.
Back-stepping through the log file to the start of the current phase should tell you what it's trying to do in this phase.

My experience with madgraph hasn't been been good - native it will run 2 cores forcing other tasks off the machine,
Is there a chance of finishing, runRivet.log is still growing, last line so long:
INFO: Idle:185, Running: 2, Completed: 413 [ 35h 51 min]

It also has significant stretches of not actually using CPU at all. We did have a thread about it some months back.
ID: 43268 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2151
Credit: 160,937,570
RAC: 52,740
Message 43269 - Posted: 24 Aug 2020, 12:55:55 UTC
Last modified: 24 Aug 2020, 13:03:48 UTC

Have found this thread you wrote - Extreme overload:
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5323#41736
Have native Linux with ONE Cpu, but in the log is a entry to use two cpu's (set nb_core 2)
How can this second Cpu being used?
The running: is 2. Now 548 Completed and Idle: 50 (seem 600 is the max.)
It would be nice if the Theory-task is reaching the 10 day limit, to get some points for this pain ;-).
Edit: There are quite a lot Fontconfig error: Cannot load default config file
ID: 43269 · Report as offensive     Reply Quote
Henry Nebrensky

Send message
Joined: 13 Jul 05
Posts: 169
Credit: 14,965,342
RAC: 1,411
Message 43270 - Posted: 24 Aug 2020, 13:35:39 UTC - in response to Message 43269.  
Last modified: 24 Aug 2020, 13:43:01 UTC

Have found this thread you wrote - Extreme overload:
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5323#41736
Have native Linux with ONE Cpu, but in the log is a entry to use two cpu's (set nb_core 2)
How can this second Cpu being used?

Same way as it used all 232 cores on computezrmle's machine! :(
It'll just chuck processes at the OS and see what happens - isn't there a rivetvm.exe as well, or is that idle while madgraph does its multiprocessing thing?

The running: is 2. Now 548 Completed and Idle: 50 (seem 600 is the max.)

Looking back at that thread you might want to do a
grep subprocess /var/lib/boinc/slots/?/cernvm/shared/runRivet.log
to check that 600 is indeed the correct number (edit: just in case "idle" doesn't mean what I think it does).
ID: 43270 · Report as offensive     Reply Quote
Henry Nebrensky

Send message
Joined: 13 Jul 05
Posts: 169
Credit: 14,965,342
RAC: 1,411
Message 43271 - Posted: 24 Aug 2020, 16:06:32 UTC - in response to Message 43268.  
Last modified: 24 Aug 2020, 16:07:27 UTC

It also has significant stretches of not actually using CPU at all.
e.g I recently killed task 281349801 precisely because it was holding two cores but idle - it's reported as using just 50 mins in 20 hours :(

We did have a thread (which maeax has kindly tracked down) about it some months back.
This does remind me that I was going to complain there that even hard-wiring the coreness to two isn't really good enough - it should either be one, or else the WUs submitted to BOINC with a consistent #cores requirement.
ID: 43271 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1134
Credit: 49,918,951
RAC: 3,274
Message 43272 - Posted: 24 Aug 2020, 16:51:31 UTC

I have never had any problems with MadGraph5 event generator tasks but it would take all day to go through all of the Valids to find just how long any of them ran and most of mine are probably from the other version of Theory 300.06

But I do watch ALL of mine start running and check the finished tasks and have saved examples of all the different event generator versions ( I have some epos and herwig7 and herwig++ running) and a few sherpa but mostly the pythia versions.

I will get on one of my desktops and see what I have saved there later today if I get a chance.
ID: 43272 · Report as offensive     Reply Quote
Henry Nebrensky

Send message
Joined: 13 Jul 05
Posts: 169
Credit: 14,965,342
RAC: 1,411
Message 43273 - Posted: 24 Aug 2020, 17:13:42 UTC - in response to Message 43272.  
Last modified: 24 Aug 2020, 17:15:39 UTC

True - there's a sampling feature in that I only check in rarely and follow up on tasks that look to be misbehaving. I also wonder if madgraph behaves better within a VM where it can't see any other cores.
ID: 43273 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2151
Credit: 160,937,570
RAC: 52,740
Message 43274 - Posted: 24 Aug 2020, 17:31:15 UTC

First round was ending after 600 events (49h 2m).
The second is running for the moment. First line Computing upper envelope
INFO: Idle:598, Running 2, Completed: 0 current time 16h17 (Thinking this is the time to finish the second round.
Tomorrow morning seeing the next point - good night.
ID: 43274 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2151
Credit: 160,937,570
RAC: 52,740
Message 43277 - Posted: 25 Aug 2020, 16:48:21 UTC - in response to Message 43274.  
Last modified: 25 Aug 2020, 16:48:45 UTC

Magic you are right. After:
Laufzeit 3 Tage 5 Stunden 48 min. 40 sek.
CPU Zeit 6 Tage 2 Stunden 0 min.
The task finished successful.
Don't understand why a second Cpu was used in a VM with one CPU defined.
ID: 43277 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1134
Credit: 49,918,951
RAC: 3,274
Message 43278 - Posted: 25 Aug 2020, 18:44:08 UTC - in response to Message 43277.  

Thanks maeax
ID: 43278 · Report as offensive     Reply Quote
Henry Nebrensky

Send message
Joined: 13 Jul 05
Posts: 169
Credit: 14,965,342
RAC: 1,411
Message 50267 - Posted: 28 May 2024, 14:16:16 UTC - in response to Message 43271.  
Last modified: 28 May 2024, 15:07:00 UTC

It also has significant stretches of not actually using CPU at all.
e.g I recently killed task 281349801 precisely because it was holding two cores but idle - it's reported as using just 50 mins in 20 hours :(
Just aborted task 411417290 as runRivet.log shows
===> [runRivet] Tue May 28 13:26:22 UTC 2024 [boinc pp zinclusive 13000 - - madgraph5amc 2.6.0.atlas nlo 100000 160]
and then
INFO: Result for check_poles:
INFO: Poles successfully cancel for 20 points over 20 (tolerance=1.0e-05)
INFO: Starting run
INFO: Using 2 cores
INFO: Cleaning previous results
INFO: Generating events without running the shower.
INFO: Setting up grids
WARNING: program /shared/tmp/tmp.CDpOJU3qAk/MG5RUN/SubProcesses/P0_uux_epem/ajob1 1 F 0 0 launch ends with non zero status: 1. Stop all computation
/shared/tmp/tmp.CDpOJU3qAk/MG5RUN/SubProcesses/P0_uux_epem/ajob1: line 34: 10644 Terminated ../madevent_mintMC > log.txt < input_app.txt 2>&1
INFO: Idle: 6, Running: 0, Completed: 2 [ current time: 13h38 ]

I'd have left it running to see what happens, but the machine will get powered off soon. Looks like internal fault detection isn't being passed up the chain and stopping the task, or did I just get too impatient?

Edit: upshot was that the task was taking up a slot but using zero CPU - hence my revisiting an old thread about madgraph doing exactly that.
ID: 50267 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2443
Credit: 230,880,354
RAC: 123,016
Message 50268 - Posted: 28 May 2024, 14:38:44 UTC - in response to Message 50267.  

... or did I just get too impatient?

No.
It was OK to cancel that task.
ID: 50268 · Report as offensive     Reply Quote
Henry Nebrensky

Send message
Joined: 13 Jul 05
Posts: 169
Credit: 14,965,342
RAC: 1,411
Message 50269 - Posted: 28 May 2024, 15:05:22 UTC - in response to Message 50268.  

Thanks... I wasn't sure if there was some clean-up check that would catch it, apart from the 10-day mop-up.

I should have said earlier: upshot was that the task was taking up a slot but using zero CPU - hence my dredging up an old thread about madgraph doing exactly that.
ID: 50269 · Report as offensive     Reply Quote

Message boards : Theory Application : MadGraph5


©2024 CERN