Message boards : Theory Application : MadGraph5
Message board moderation

To post messages, you must log in.

AuthorMessage
maeax

Send message
Joined: 2 May 07
Posts: 1301
Credit: 39,583,517
RAC: 11,358
Message 43261 - Posted: 24 Aug 2020, 3:21:40 UTC

Have a MadGraph5 Task running more than 40 hours so long.
madgraph5amc 2.6.5.atlas nlo2jet - zinclusive 7000 -,-,50,130
MC Production matrix: 0+4/18

Is there a chance of finishing, runRivet.log is still growing, last line so long:
INFO: Idle:185, Running: 2, Completed: 413 [ 35h 51 min]

https://launchpad.net/mg5amcnlo
ID: 43261 · Report as offensive     Reply Quote
Henry Nebrensky

Send message
Joined: 13 Jul 05
Posts: 147
Credit: 14,665,277
RAC: 0
Message 43268 - Posted: 24 Aug 2020, 12:26:35 UTC - in response to Message 43261.  

My guess is that the "idle" number is slowly reducing until "Completed" reaches 600 when either the task completes, or starts a whole new phase...
Can you leave it for ~20hrs and see what happens? As long as the log file is growing then there's some grounds for optimism it'll finish OK.
Back-stepping through the log file to the start of the current phase should tell you what it's trying to do in this phase.

My experience with madgraph hasn't been been good - native it will run 2 cores forcing other tasks off the machine,
Is there a chance of finishing, runRivet.log is still growing, last line so long:
INFO: Idle:185, Running: 2, Completed: 413 [ 35h 51 min]

It also has significant stretches of not actually using CPU at all. We did have a thread about it some months back.
ID: 43268 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1301
Credit: 39,583,517
RAC: 11,358
Message 43269 - Posted: 24 Aug 2020, 12:55:55 UTC
Last modified: 24 Aug 2020, 13:03:48 UTC

Have found this thread you wrote - Extreme overload:
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5323#41736
Have native Linux with ONE Cpu, but in the log is a entry to use two cpu's (set nb_core 2)
How can this second Cpu being used?
The running: is 2. Now 548 Completed and Idle: 50 (seem 600 is the max.)
It would be nice if the Theory-task is reaching the 10 day limit, to get some points for this pain ;-).
Edit: There are quite a lot Fontconfig error: Cannot load default config file
ID: 43269 · Report as offensive     Reply Quote
Henry Nebrensky

Send message
Joined: 13 Jul 05
Posts: 147
Credit: 14,665,277
RAC: 0
Message 43270 - Posted: 24 Aug 2020, 13:35:39 UTC - in response to Message 43269.  
Last modified: 24 Aug 2020, 13:43:01 UTC

Have found this thread you wrote - Extreme overload:
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5323#41736
Have native Linux with ONE Cpu, but in the log is a entry to use two cpu's (set nb_core 2)
How can this second Cpu being used?

Same way as it used all 232 cores on computezrmle's machine! :(
It'll just chuck processes at the OS and see what happens - isn't there a rivetvm.exe as well, or is that idle while madgraph does its multiprocessing thing?

The running: is 2. Now 548 Completed and Idle: 50 (seem 600 is the max.)

Looking back at that thread you might want to do a
grep subprocess /var/lib/boinc/slots/?/cernvm/shared/runRivet.log
to check that 600 is indeed the correct number (edit: just in case "idle" doesn't mean what I think it does).
ID: 43270 · Report as offensive     Reply Quote
Henry Nebrensky

Send message
Joined: 13 Jul 05
Posts: 147
Credit: 14,665,277
RAC: 0
Message 43271 - Posted: 24 Aug 2020, 16:06:32 UTC - in response to Message 43268.  
Last modified: 24 Aug 2020, 16:07:27 UTC

It also has significant stretches of not actually using CPU at all.
e.g I recently killed task 281349801 precisely because it was holding two cores but idle - it's reported as using just 50 mins in 20 hours :(

We did have a thread (which maeax has kindly tracked down) about it some months back.
This does remind me that I was going to complain there that even hard-wiring the coreness to two isn't really good enough - it should either be one, or else the WUs submitted to BOINC with a consistent #cores requirement.
ID: 43271 · Report as offensive     Reply Quote
Profile MAGIC Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1000
Credit: 45,919,581
RAC: 4,765
Message 43272 - Posted: 24 Aug 2020, 16:51:31 UTC

I have never had any problems with MadGraph5 event generator tasks but it would take all day to go through all of the Valids to find just how long any of them ran and most of mine are probably from the other version of Theory 300.06

But I do watch ALL of mine start running and check the finished tasks and have saved examples of all the different event generator versions ( I have some epos and herwig7 and herwig++ running) and a few sherpa but mostly the pythia versions.

I will get on one of my desktops and see what I have saved there later today if I get a chance.
ID: 43272 · Report as offensive     Reply Quote
Henry Nebrensky

Send message
Joined: 13 Jul 05
Posts: 147
Credit: 14,665,277
RAC: 0
Message 43273 - Posted: 24 Aug 2020, 17:13:42 UTC - in response to Message 43272.  
Last modified: 24 Aug 2020, 17:15:39 UTC

True - there's a sampling feature in that I only check in rarely and follow up on tasks that look to be misbehaving. I also wonder if madgraph behaves better within a VM where it can't see any other cores.
ID: 43273 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1301
Credit: 39,583,517
RAC: 11,358
Message 43274 - Posted: 24 Aug 2020, 17:31:15 UTC

First round was ending after 600 events (49h 2m).
The second is running for the moment. First line Computing upper envelope
INFO: Idle:598, Running 2, Completed: 0 current time 16h17 (Thinking this is the time to finish the second round.
Tomorrow morning seeing the next point - good night.
ID: 43274 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 1301
Credit: 39,583,517
RAC: 11,358
Message 43277 - Posted: 25 Aug 2020, 16:48:21 UTC - in response to Message 43274.  
Last modified: 25 Aug 2020, 16:48:45 UTC

Magic you are right. After:
Laufzeit 3 Tage 5 Stunden 48 min. 40 sek.
CPU Zeit 6 Tage 2 Stunden 0 min.
The task finished successful.
Don't understand why a second Cpu was used in a VM with one CPU defined.
ID: 43277 · Report as offensive     Reply Quote
Profile MAGIC Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1000
Credit: 45,919,581
RAC: 4,765
Message 43278 - Posted: 25 Aug 2020, 18:44:08 UTC - in response to Message 43277.  

Thanks maeax
ID: 43278 · Report as offensive     Reply Quote

Message boards : Theory Application : MadGraph5


©2021 CERN