Message boards : Theory Application : New Version 263.70
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3

AuthorMessage
bronco

Send message
Joined: 13 Apr 18
Posts: 443
Credit: 8,438,885
RAC: 0
Message 36660 - Posted: 7 Sep 2018, 19:22:59 UTC - in response to Message 36657.  

Yup. 2 explanations:

1) You have hyper-threading (HT) enabled on the Intel. HT is good for maths non-intensive computing environments but when the CPU is doing intensive maths, especially floating point calculations, as we do in many BOINC projects, HT is a bad idea. Your CPU has only 1 floating point unit (FPU). If you are crunching 4 tasks then you have 4 threads all trying to access that single FPU simultaneously which causes many cache misses and slower overall throughput.

2) Credits are completely fubar. Don't believe credits or any conclusion/theory based on credits.
ID: 36660 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1691
Credit: 104,369,629
RAC: 123,331
Message 36661 - Posted: 7 Sep 2018, 20:11:41 UTC - in response to Message 36660.  

Yup. 2 explanations:
1) You have hyper-threading (HT) enabled on the Intel. ... If you are crunching 4 tasks ...
I am crunching 1 task only; and no other projects or apps running (hence, the Windows Task Manager shows 25% CPU load).
ID: 36661 · Report as offensive     Reply Quote
Profile Ray Murray
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 281
Credit: 11,859,285
RAC: 0
Message 36711 - Posted: 14 Sep 2018, 19:01:02 UTC

My machines seem to work best with single-core Theorys, and I'm happy with that, but recently I've noticed a few tasks that have been sitting idle after finishing a job and not requesting another. After 12hrs that would be expected and the VM would shut down but some of these can show this behaviour after only a few hours and will sulk until the 18hr autokill. I have even caught a couple that have stuck part-way through a job, cursor flashing in console as if it is working on the next event but no CPU activity. If I intervene and manually reset the VM it will reboot, fetch work as normal and continue happily until the 18hr autokill.
I'll let the current tasks complete then dump the .vdi to force a new download just in case it has somehow become corrupted but this has happened on 2 of 3 hosts so might not solve the issue. As is often the case, it is random and intermittent, so not repeatable or easy to identify.

From StartLog, showing a normal job finish and new start and the final line:

09/14/18 12:41:38 Called deactivate_claim_forcibly()
09/14/18 12:41:38 Starter pid 26901 exited with status 0
09/14/18 12:41:38 State change: starter exited
09/14/18 12:41:38 Changing activity: Busy -> Idle
09/14/18 12:41:40 Got activate_claim request from shadow (188.184.94.254)
09/14/18 12:41:40 Remote job ID is 469815.391
09/14/18 12:41:40 Got universe "VANILLA" (5) from request classad
09/14/18 12:41:40 State change: claim-activation protocol successful
09/14/18 12:41:40 Changing activity: Idle -> Busy
09/14/18 12:57:32 Called deactivate_claim_forcibly()
09/14/18 12:57:32 Starter pid 30282 exited with status 0
09/14/18 12:57:32 State change: starter exited
09/14/18 12:57:32 Changing activity: Busy -> Idle
09/14/18 12:57:32 Got activate_claim request from shadow (188.184.94.254)
09/14/18 12:57:32 Remote job ID is 469815.652
09/14/18 12:57:32 Got universe "VANILLA" (5) from request classad
09/14/18 12:57:32 State change: claim-activation protocol successful
09/14/18 12:57:32 Changing activity: Idle -> Busy
09/14/18 19:12:19 CronJob: Job 'multicore' is still running!
(its own exclamation mark, not mine)
And stdout.log
12:57:37 +0200 2018-09-14 [INFO] New Job Starting in slot1
12:57:37 +0200 2018-09-14 [INFO] Condor JobID: 469815.652 in slot1
12:57:42 +0200 2018-09-14 [INFO] MCPlots JobID: 46329784 in slot1
19:11:33 +0200 2018-09-14 [INFO] Job finished in slot1 with 0.
but no request for new work even at only 11hrs elapsed.
ID: 36711 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2420
Credit: 226,976,109
RAC: 131,821
Message 36712 - Posted: 14 Sep 2018, 19:47:03 UTC - in response to Message 36711.  

... recently I've noticed a few tasks that have been sitting idle after finishing a job and not requesting another.

This can also be caused by a busy network or server at CERN.
It occasionally affects Theory but at the moment it looks like LHCb suffers a lot from not working uploads of intermediate job results.
In this case port 9148 at lbboinc01.cern.ch responds after a huge delay of several minutes (!) and as a result of this delay the response is rejected by the firewall.
Things like that are very hard to identify.
ID: 36712 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1281
Credit: 8,505,126
RAC: 2,417
Message 36714 - Posted: 14 Sep 2018, 20:37:32 UTC - in response to Message 36712.  

I've seen this idling very regularly. Mostly it's happening after a longer suspend of the job and a resume.
Mostly the events are processed again for another 40 minutes and then suddenly the idling of the VM starts.
I've reported it here and at the Dev-project's forum.

Another issue is when a job inside the VM errors the job is not killed, no new job is requested and the VM is idling until 18 hours.

E.g.:

===> [runRivet] Fri Sep 14 16:25:01 CEST 2018 [boinc pp jets 7000 10 - pythia6 6.428 z2 100000 316]
.
.
.
data:  REF_ATLAS_2011_S9126244_d29-x01-y01.dat -> /var/lib/condor/execute/dir_14330/dat/pp/jets/njets-vs-pt-lj/atlas2011-dy4-5/7000/ATLAS_2011_S9126244.dat
data:  REF_ATLAS_2011_S9126244_d29-x01-y02.dat -> /var/lib/condor/execute/dir_14330/dat/pp/jets/njets-vs-pt-fb/atlas2011-dy4-5/7000/ATLAS_2011_S9126244.dat
data:  REF_CMS_2011_S9086218_d01-x01-y01.dat -> /var/lib/condor/execute/dir_14330/dat/pp/jets/pt/cms2011-y0.5/7000/CMS_2011_S9086218.dat
data:  REF_CMS_2011_S9086218_d02-x01-y01.dat -> /var/lib/condor/execute/dir_14330/dat/pp/jets/pt/cms2011-y1.0/7000/CMS_2011_S9086218.dat
data:  REF_CMS_2011_S9086218_d03-x01-y01.dat -> /var/lib/condor/execute/dir_14330/dat/pp/jets/pt/cms2011-y1.5/7000/CMS_2011_S9086218.dat
data:  REF_CMS_2011_S9086218_d04-x01-y01.dat -> /var/lib/condor/execute/dir_14330/dat/pp/jets/pt/cms2011-y2.0/7000/CMS_2011_S9086218.dat
data:  REF_CMS_2011_S9086218_d05-x01-y01.dat -> /var/lib/condor/execute/dir_14330/dat/pp/jets/pt/cms2011-y2.5/7000/CMS_2011_S9086218.dat
data:  REF_CMS_2011_S9086218_d06-x01-y01.dat -> /var/lib/condor/execute/dir_14330/dat/pp/jets/pt/cms2011-y3.0/7000/CMS_2011_S9086218.dat
data:  REF_CMS_2012_I1087342_d01-x01-y01.dat -> /var/lib/condor/execute/dir_14330/dat/pp/jets/pt/cms2012-fjets-inclforward/7000/CMS_2012_I1087342.dat
data:  REF_CMS_2012_I1087342_d02-x01-y01.dat -> /var/lib/condor/execute/dir_14330/dat/pp/jets/pt/cms2012-fjets-forward/7000/CMS_2012_I1087342.dat
data:  REF_CMS_2012_I1087342_d03-x01-y01.dat -> /var/lib/condor/execute/dir_14330/dat/pp/jets/pt/cms2012-fjets-central/7000/CMS_2012_I1087342.dat
ERROR: following histograms should be produced according to run parameters,
       but missing from Rivet output:
         CMS_2012_I1102908_d01-x01-y01
         CMS_2012_I1102908_d02-x01-y01

       check mapping of above histograms in configuration file:
         configuration/rivet-histograms.map
ID: 36714 · Report as offensive     Reply Quote
Profile Ray Murray
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 281
Credit: 11,859,285
RAC: 0
Message 36715 - Posted: 14 Sep 2018, 20:41:30 UTC - in response to Message 36712.  

Thanks,
I thought I had confused it somehow by only allowing the multicore app to use a single core.
Still ... after not getting a new job, should it not self-terminate after 12hr rather than the full 18?
ID: 36715 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1281
Credit: 8,505,126
RAC: 2,417
Message 36716 - Posted: 14 Sep 2018, 21:00:42 UTC - in response to Message 36715.  

Still ... after not getting a new job, should it not self-terminate after 12hr rather than the full 18?

Normally yeah, but even when you think a job ended normal, it could be that it's one of a resumed job and ends within my former mentioned 40 minutes.
In that case Condor is not aware of it and will not close the VM, so BOINC have to do that after the 18 hours limit (or do it yourself, where you have to change boincpath)

@echo off
set "slotdir="
set /p "slotdir=In which slot-directory is the endless Theory task running you want to kill? "
set boincpath="D:\Boinc1\slots\%slotdir%\shared"
copy /y NUL %boincpath%\shutdown >NUL
exit
ID: 36716 · Report as offensive     Reply Quote
Profile Ray Murray
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 281
Credit: 11,859,285
RAC: 0
Message 36761 - Posted: 18 Sep 2018, 21:14:19 UTC

2 more Tasks sat idle for 12hrs today. I've set "Switch between apps" to 1000mins so it's not a suspend/resume issue. Maybe the server that we're now connecting to with this latest app isn't up to all the extra traffic it's getting and, as computezrmle suggested, the connection is timing out? In which case, should the VM not be shut down at that point?
A manual VM reset forces a new connection and the Tasks will run, doing useful work again, to the 18hr cutoff. I'm getting full credits for these ones that have had idle time (I actually think I've been getting too much credit recently, compare to previous amounts) but I'd rather those credits were earned fairly by doing actual MCPlots events.
ID: 36761 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1129
Credit: 49,762,040
RAC: 7,480
Message 36764 - Posted: 19 Sep 2018, 3:51:23 UTC
Last modified: 19 Sep 2018, 3:51:48 UTC

Well things have been running fine here but something made me check just now and I see I got about 25 of these all in the same hour.

[ERROR] Condor exited after 808s without running a job.

2018-09-18 19:47:46 (9884): Guest Log: [INFO] Shutting Down.

2018-09-18 19:47:46 (9884): VM Completion File Detected.
2018-09-18 19:47:46 (9884): VM Completion Message: Condor exited after 808s without running a job.

So I will watch for a while before I suspend these Theory tasks.

The ones I just watched starting up did get past the HTCondor Ping - OK (0)
ID: 36764 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2114
Credit: 159,867,104
RAC: 92,239
Message 36765 - Posted: 19 Sep 2018, 4:04:08 UTC

ID: 36765 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1129
Credit: 49,762,040
RAC: 7,480
Message 36767 - Posted: 19 Sep 2018, 4:23:30 UTC - in response to Message 36765.  

MC-Production is stopped yesterday afternoon:
http://mcplots-dev.cern.ch/production.php?view=status&plots=hourly#plots

They stopped working here about 2 hours ago so I am letting the ones that have been running for 5+ hours run and suspended all the rest.

I just had a few finished tasks finish and sent in Valid in the last 15 minutes.

I will take another look later tonight (9:20pm right now)

My old X86 can't even get new work to become errors right now
ID: 36767 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1691
Credit: 104,369,629
RAC: 123,331
Message 36769 - Posted: 19 Sep 2018, 5:14:40 UTC

Indeed, since last night I received the following error notice for all my Theory tasks:

207 (0x000000CF) EXIT_NO_SUB_TASKS

stderr says:

2018-09-19 05:15:12 (13752): Guest Log: [ERROR] No jobs were available to run.

Any idea when production of jobs will be resumed?
ID: 36769 · Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 378
Credit: 238,712
RAC: 0
Message 36773 - Posted: 19 Sep 2018, 8:13:01 UTC - in response to Message 36764.  

Yes, we are out of jobs. There are non being provided by the mcplots server. The Theory tasks were automatically stopped. This was probably caused by the on going intervention that we have to reboot all hypervisors in our data centre.
ID: 36773 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3

Message boards : Theory Application : New Version 263.70


©2024 CERN