Message boards :
Theory Application :
New Version 263.70
Message board moderation
Previous · 1 · 2 · 3
Author | Message |
---|---|
Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0 |
Yup. 2 explanations: 1) You have hyper-threading (HT) enabled on the Intel. HT is good for maths non-intensive computing environments but when the CPU is doing intensive maths, especially floating point calculations, as we do in many BOINC projects, HT is a bad idea. Your CPU has only 1 floating point unit (FPU). If you are crunching 4 tasks then you have 4 threads all trying to access that single FPU simultaneously which causes many cache misses and slower overall throughput. 2) Credits are completely fubar. Don't believe credits or any conclusion/theory based on credits. |
Send message Joined: 18 Dec 15 Posts: 1691 Credit: 104,407,616 RAC: 122,906 |
Yup. 2 explanations:I am crunching 1 task only; and no other projects or apps running (hence, the Windows Task Manager shows 25% CPU load). |
Send message Joined: 29 Sep 04 Posts: 281 Credit: 11,859,285 RAC: 0 |
My machines seem to work best with single-core Theorys, and I'm happy with that, but recently I've noticed a few tasks that have been sitting idle after finishing a job and not requesting another. After 12hrs that would be expected and the VM would shut down but some of these can show this behaviour after only a few hours and will sulk until the 18hr autokill. I have even caught a couple that have stuck part-way through a job, cursor flashing in console as if it is working on the next event but no CPU activity. If I intervene and manually reset the VM it will reboot, fetch work as normal and continue happily until the 18hr autokill. I'll let the current tasks complete then dump the .vdi to force a new download just in case it has somehow become corrupted but this has happened on 2 of 3 hosts so might not solve the issue. As is often the case, it is random and intermittent, so not repeatable or easy to identify. From StartLog, showing a normal job finish and new start and the final line: 09/14/18 12:41:38 Called deactivate_claim_forcibly() 09/14/18 12:41:38 Starter pid 26901 exited with status 0 09/14/18 12:41:38 State change: starter exited 09/14/18 12:41:38 Changing activity: Busy -> Idle 09/14/18 12:41:40 Got activate_claim request from shadow (188.184.94.254) 09/14/18 12:41:40 Remote job ID is 469815.391 09/14/18 12:41:40 Got universe "VANILLA" (5) from request classad 09/14/18 12:41:40 State change: claim-activation protocol successful 09/14/18 12:41:40 Changing activity: Idle -> Busy 09/14/18 12:57:32 Called deactivate_claim_forcibly() 09/14/18 12:57:32 Starter pid 30282 exited with status 0 09/14/18 12:57:32 State change: starter exited 09/14/18 12:57:32 Changing activity: Busy -> Idle 09/14/18 12:57:32 Got activate_claim request from shadow (188.184.94.254) 09/14/18 12:57:32 Remote job ID is 469815.652 09/14/18 12:57:32 Got universe "VANILLA" (5) from request classad 09/14/18 12:57:32 State change: claim-activation protocol successful 09/14/18 12:57:32 Changing activity: Idle -> Busy 09/14/18 19:12:19 CronJob: Job 'multicore' is still running! (its own exclamation mark, not mine) And stdout.log 12:57:37 +0200 2018-09-14 [INFO] New Job Starting in slot1 12:57:37 +0200 2018-09-14 [INFO] Condor JobID: 469815.652 in slot1 12:57:42 +0200 2018-09-14 [INFO] MCPlots JobID: 46329784 in slot1 19:11:33 +0200 2018-09-14 [INFO] Job finished in slot1 with 0. but no request for new work even at only 11hrs elapsed. |
Send message Joined: 15 Jun 08 Posts: 2420 Credit: 227,015,758 RAC: 131,395 |
... recently I've noticed a few tasks that have been sitting idle after finishing a job and not requesting another. This can also be caused by a busy network or server at CERN. It occasionally affects Theory but at the moment it looks like LHCb suffers a lot from not working uploads of intermediate job results. In this case port 9148 at lbboinc01.cern.ch responds after a huge delay of several minutes (!) and as a result of this delay the response is rejected by the firewall. Things like that are very hard to identify. |
Send message Joined: 14 Jan 10 Posts: 1281 Credit: 8,508,235 RAC: 2,678 |
I've seen this idling very regularly. Mostly it's happening after a longer suspend of the job and a resume. Mostly the events are processed again for another 40 minutes and then suddenly the idling of the VM starts. I've reported it here and at the Dev-project's forum. Another issue is when a job inside the VM errors the job is not killed, no new job is requested and the VM is idling until 18 hours. E.g.: ===> [runRivet] Fri Sep 14 16:25:01 CEST 2018 [boinc pp jets 7000 10 - pythia6 6.428 z2 100000 316] . . . data: REF_ATLAS_2011_S9126244_d29-x01-y01.dat -> /var/lib/condor/execute/dir_14330/dat/pp/jets/njets-vs-pt-lj/atlas2011-dy4-5/7000/ATLAS_2011_S9126244.dat data: REF_ATLAS_2011_S9126244_d29-x01-y02.dat -> /var/lib/condor/execute/dir_14330/dat/pp/jets/njets-vs-pt-fb/atlas2011-dy4-5/7000/ATLAS_2011_S9126244.dat data: REF_CMS_2011_S9086218_d01-x01-y01.dat -> /var/lib/condor/execute/dir_14330/dat/pp/jets/pt/cms2011-y0.5/7000/CMS_2011_S9086218.dat data: REF_CMS_2011_S9086218_d02-x01-y01.dat -> /var/lib/condor/execute/dir_14330/dat/pp/jets/pt/cms2011-y1.0/7000/CMS_2011_S9086218.dat data: REF_CMS_2011_S9086218_d03-x01-y01.dat -> /var/lib/condor/execute/dir_14330/dat/pp/jets/pt/cms2011-y1.5/7000/CMS_2011_S9086218.dat data: REF_CMS_2011_S9086218_d04-x01-y01.dat -> /var/lib/condor/execute/dir_14330/dat/pp/jets/pt/cms2011-y2.0/7000/CMS_2011_S9086218.dat data: REF_CMS_2011_S9086218_d05-x01-y01.dat -> /var/lib/condor/execute/dir_14330/dat/pp/jets/pt/cms2011-y2.5/7000/CMS_2011_S9086218.dat data: REF_CMS_2011_S9086218_d06-x01-y01.dat -> /var/lib/condor/execute/dir_14330/dat/pp/jets/pt/cms2011-y3.0/7000/CMS_2011_S9086218.dat data: REF_CMS_2012_I1087342_d01-x01-y01.dat -> /var/lib/condor/execute/dir_14330/dat/pp/jets/pt/cms2012-fjets-inclforward/7000/CMS_2012_I1087342.dat data: REF_CMS_2012_I1087342_d02-x01-y01.dat -> /var/lib/condor/execute/dir_14330/dat/pp/jets/pt/cms2012-fjets-forward/7000/CMS_2012_I1087342.dat data: REF_CMS_2012_I1087342_d03-x01-y01.dat -> /var/lib/condor/execute/dir_14330/dat/pp/jets/pt/cms2012-fjets-central/7000/CMS_2012_I1087342.dat ERROR: following histograms should be produced according to run parameters, but missing from Rivet output: CMS_2012_I1102908_d01-x01-y01 CMS_2012_I1102908_d02-x01-y01 check mapping of above histograms in configuration file: configuration/rivet-histograms.map |
Send message Joined: 29 Sep 04 Posts: 281 Credit: 11,859,285 RAC: 0 |
Thanks, I thought I had confused it somehow by only allowing the multicore app to use a single core. Still ... after not getting a new job, should it not self-terminate after 12hr rather than the full 18? |
Send message Joined: 14 Jan 10 Posts: 1281 Credit: 8,508,235 RAC: 2,678 |
Still ... after not getting a new job, should it not self-terminate after 12hr rather than the full 18? Normally yeah, but even when you think a job ended normal, it could be that it's one of a resumed job and ends within my former mentioned 40 minutes. In that case Condor is not aware of it and will not close the VM, so BOINC have to do that after the 18 hours limit (or do it yourself, where you have to change boincpath) @echo off set "slotdir=" set /p "slotdir=In which slot-directory is the endless Theory task running you want to kill? " set boincpath="D:\Boinc1\slots\%slotdir%\shared" copy /y NUL %boincpath%\shutdown >NUL exit |
Send message Joined: 29 Sep 04 Posts: 281 Credit: 11,859,285 RAC: 0 |
2 more Tasks sat idle for 12hrs today. I've set "Switch between apps" to 1000mins so it's not a suspend/resume issue. Maybe the server that we're now connecting to with this latest app isn't up to all the extra traffic it's getting and, as computezrmle suggested, the connection is timing out? In which case, should the VM not be shut down at that point? A manual VM reset forces a new connection and the Tasks will run, doing useful work again, to the 18hr cutoff. I'm getting full credits for these ones that have had idle time (I actually think I've been getting too much credit recently, compare to previous amounts) but I'd rather those credits were earned fairly by doing actual MCPlots events. |
Send message Joined: 24 Oct 04 Posts: 1129 Credit: 49,762,040 RAC: 7,480 |
Well things have been running fine here but something made me check just now and I see I got about 25 of these all in the same hour. [ERROR] Condor exited after 808s without running a job. 2018-09-18 19:47:46 (9884): Guest Log: [INFO] Shutting Down. 2018-09-18 19:47:46 (9884): VM Completion File Detected. 2018-09-18 19:47:46 (9884): VM Completion Message: Condor exited after 808s without running a job. So I will watch for a while before I suspend these Theory tasks. The ones I just watched starting up did get past the HTCondor Ping - OK (0) |
Send message Joined: 2 May 07 Posts: 2114 Credit: 159,883,105 RAC: 90,504 |
MC-Production is stopped yesterday afternoon: http://mcplots-dev.cern.ch/production.php?view=status&plots=hourly#plots |
Send message Joined: 24 Oct 04 Posts: 1129 Credit: 49,762,040 RAC: 7,480 |
MC-Production is stopped yesterday afternoon: They stopped working here about 2 hours ago so I am letting the ones that have been running for 5+ hours run and suspended all the rest. I just had a few finished tasks finish and sent in Valid in the last 15 minutes. I will take another look later tonight (9:20pm right now) My old X86 can't even get new work to become errors right now |
Send message Joined: 18 Dec 15 Posts: 1691 Credit: 104,407,616 RAC: 122,906 |
Indeed, since last night I received the following error notice for all my Theory tasks: 207 (0x000000CF) EXIT_NO_SUB_TASKS stderr says: 2018-09-19 05:15:12 (13752): Guest Log: [ERROR] No jobs were available to run. Any idea when production of jobs will be resumed? |
Send message Joined: 20 Jun 14 Posts: 378 Credit: 238,712 RAC: 0 |
Yes, we are out of jobs. There are non being provided by the mcplots server. The Theory tasks were automatically stopped. This was probably caused by the on going intervention that we have to reboot all hypervisors in our data centre. |
©2024 CERN