Thread 'New Version 263.70'

Author	Message
bronco Send message Joined: 13 Apr 18 Posts: 443 Credit: 8,438,885 RAC: 0	Message 36660 - Posted: 7 Sep 2018, 19:22:59 UTC - in response to Message 36657. Yup. 2 explanations: 1) You have hyper-threading (HT) enabled on the Intel. HT is good for maths non-intensive computing environments but when the CPU is doing intensive maths, especially floating point calculations, as we do in many BOINC projects, HT is a bad idea. Your CPU has only 1 floating point unit (FPU). If you are crunching 4 tasks then you have 4 threads all trying to access that single FPU simultaneously which causes many cache misses and slower overall throughput. 2) Credits are completely fubar. Don't believe credits or any conclusion/theory based on credits. ID: 36660 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1984 Credit: 162,140,367 RAC: 87,130	Message 36661 - Posted: 7 Sep 2018, 20:11:41 UTC - in response to Message 36660. Yup. 2 explanations: 1) You have hyper-threading (HT) enabled on the Intel. ... If you are crunching 4 tasks ... I am crunching 1 task only; and no other projects or apps running (hence, the Windows Task Manager shows 25% CPU load). ID: 36661 · Reply Quote

Ray Murray Volunteer moderator Send message Joined: 29 Sep 04 Posts: 281 Credit: 11,888,115 RAC: 0	Message 36711 - Posted: 14 Sep 2018, 19:01:02 UTC My machines seem to work best with single-core Theorys, and I'm happy with that, but recently I've noticed a few tasks that have been sitting idle after finishing a job and not requesting another. After 12hrs that would be expected and the VM would shut down but some of these can show this behaviour after only a few hours and will sulk until the 18hr autokill. I have even caught a couple that have stuck part-way through a job, cursor flashing in console as if it is working on the next event but no CPU activity. If I intervene and manually reset the VM it will reboot, fetch work as normal and continue happily until the 18hr autokill. I'll let the current tasks complete then dump the .vdi to force a new download just in case it has somehow become corrupted but this has happened on 2 of 3 hosts so might not solve the issue. As is often the case, it is random and intermittent, so not repeatable or easy to identify. From StartLog, showing a normal job finish and new start and the final line: 09/14/18 12:41:38 Called deactivate_claim_forcibly() 09/14/18 12:41:38 Starter pid 26901 exited with status 0 09/14/18 12:41:38 State change: starter exited 09/14/18 12:41:38 Changing activity: Busy -> Idle 09/14/18 12:41:40 Got activate_claim request from shadow (188.184.94.254) 09/14/18 12:41:40 Remote job ID is 469815.391 09/14/18 12:41:40 Got universe "VANILLA" (5) from request classad 09/14/18 12:41:40 State change: claim-activation protocol successful 09/14/18 12:41:40 Changing activity: Idle -> Busy 09/14/18 12:57:32 Called deactivate_claim_forcibly() 09/14/18 12:57:32 Starter pid 30282 exited with status 0 09/14/18 12:57:32 State change: starter exited 09/14/18 12:57:32 Changing activity: Busy -> Idle 09/14/18 12:57:32 Got activate_claim request from shadow (188.184.94.254) 09/14/18 12:57:32 Remote job ID is 469815.652 09/14/18 12:57:32 Got universe "VANILLA" (5) from request classad 09/14/18 12:57:32 State change: claim-activation protocol successful 09/14/18 12:57:32 Changing activity: Idle -> Busy 09/14/18 19:12:19 CronJob: Job 'multicore' is still running! (its own exclamation mark, not mine) And stdout.log 12:57:37 +0200 2018-09-14 [INFO] New Job Starting in slot1 12:57:37 +0200 2018-09-14 [INFO] Condor JobID: 469815.652 in slot1 12:57:42 +0200 2018-09-14 [INFO] MCPlots JobID: 46329784 in slot1 19:11:33 +0200 2018-09-14 [INFO] Job finished in slot1 with 0. but no request for new work even at only 11hrs elapsed. ID: 36711 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2753 Credit: 304,220,101 RAC: 115,434	Message 36712 - Posted: 14 Sep 2018, 19:47:03 UTC - in response to Message 36711. ... recently I've noticed a few tasks that have been sitting idle after finishing a job and not requesting another. This can also be caused by a busy network or server at CERN. It occasionally affects Theory but at the moment it looks like LHCb suffers a lot from not working uploads of intermediate job results. In this case port 9148 at lbboinc01.cern.ch responds after a huge delay of several minutes (!) and as a result of this delay the response is rejected by the firewall. Things like that are very hard to identify. ID: 36712 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1556 Credit: 10,101,515 RAC: 1,464	Message 36714 - Posted: 14 Sep 2018, 20:37:32 UTC - in response to Message 36712. I've seen this idling very regularly. Mostly it's happening after a longer suspend of the job and a resume. Mostly the events are processed again for another 40 minutes and then suddenly the idling of the VM starts. I've reported it here and at the Dev-project's forum. Another issue is when a job inside the VM errors the job is not killed, no new job is requested and the VM is idling until 18 hours. E.g.: ===> [runRivet] Fri Sep 14 16:25:01 CEST 2018 [boinc pp jets 7000 10 - pythia6 6.428 z2 100000 316] . . . data: REF_ATLAS_2011_S9126244_d29-x01-y01.dat -> /var/lib/condor/execute/dir_14330/dat/pp/jets/njets-vs-pt-lj/atlas2011-dy4-5/7000/ATLAS_2011_S9126244.dat data: REF_ATLAS_2011_S9126244_d29-x01-y02.dat -> /var/lib/condor/execute/dir_14330/dat/pp/jets/njets-vs-pt-fb/atlas2011-dy4-5/7000/ATLAS_2011_S9126244.dat data: REF_CMS_2011_S9086218_d01-x01-y01.dat -> /var/lib/condor/execute/dir_14330/dat/pp/jets/pt/cms2011-y0.5/7000/CMS_2011_S9086218.dat data: REF_CMS_2011_S9086218_d02-x01-y01.dat -> /var/lib/condor/execute/dir_14330/dat/pp/jets/pt/cms2011-y1.0/7000/CMS_2011_S9086218.dat data: REF_CMS_2011_S9086218_d03-x01-y01.dat -> /var/lib/condor/execute/dir_14330/dat/pp/jets/pt/cms2011-y1.5/7000/CMS_2011_S9086218.dat data: REF_CMS_2011_S9086218_d04-x01-y01.dat -> /var/lib/condor/execute/dir_14330/dat/pp/jets/pt/cms2011-y2.0/7000/CMS_2011_S9086218.dat data: REF_CMS_2011_S9086218_d05-x01-y01.dat -> /var/lib/condor/execute/dir_14330/dat/pp/jets/pt/cms2011-y2.5/7000/CMS_2011_S9086218.dat data: REF_CMS_2011_S9086218_d06-x01-y01.dat -> /var/lib/condor/execute/dir_14330/dat/pp/jets/pt/cms2011-y3.0/7000/CMS_2011_S9086218.dat data: REF_CMS_2012_I1087342_d01-x01-y01.dat -> /var/lib/condor/execute/dir_14330/dat/pp/jets/pt/cms2012-fjets-inclforward/7000/CMS_2012_I1087342.dat data: REF_CMS_2012_I1087342_d02-x01-y01.dat -> /var/lib/condor/execute/dir_14330/dat/pp/jets/pt/cms2012-fjets-forward/7000/CMS_2012_I1087342.dat data: REF_CMS_2012_I1087342_d03-x01-y01.dat -> /var/lib/condor/execute/dir_14330/dat/pp/jets/pt/cms2012-fjets-central/7000/CMS_2012_I1087342.dat ERROR: following histograms should be produced according to run parameters, but missing from Rivet output: CMS_2012_I1102908_d01-x01-y01 CMS_2012_I1102908_d02-x01-y01 check mapping of above histograms in configuration file: configuration/rivet-histograms.map ID: 36714 · Reply Quote

Ray Murray Volunteer moderator Send message Joined: 29 Sep 04 Posts: 281 Credit: 11,888,115 RAC: 0	Message 36715 - Posted: 14 Sep 2018, 20:41:30 UTC - in response to Message 36712. Thanks, I thought I had confused it somehow by only allowing the multicore app to use a single core. Still ... after not getting a new job, should it not self-terminate after 12hr rather than the full 18? ID: 36715 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1556 Credit: 10,101,515 RAC: 1,464	Message 36716 - Posted: 14 Sep 2018, 21:00:42 UTC - in response to Message 36715. Still ... after not getting a new job, should it not self-terminate after 12hr rather than the full 18? Normally yeah, but even when you think a job ended normal, it could be that it's one of a resumed job and ends within my former mentioned 40 minutes. In that case Condor is not aware of it and will not close the VM, so BOINC have to do that after the 18 hours limit (or do it yourself, where you have to change boincpath) @echo off set "slotdir=" set /p "slotdir=In which slot-directory is the endless Theory task running you want to kill? " set boincpath="D:\Boinc1\slots\%slotdir%\shared" copy /y NUL %boincpath%\shutdown >NUL exit ID: 36716 · Reply Quote

Ray Murray Volunteer moderator Send message Joined: 29 Sep 04 Posts: 281 Credit: 11,888,115 RAC: 0	Message 36761 - Posted: 18 Sep 2018, 21:14:19 UTC 2 more Tasks sat idle for 12hrs today. I've set "Switch between apps" to 1000mins so it's not a suspend/resume issue. Maybe the server that we're now connecting to with this latest app isn't up to all the extra traffic it's getting and, as computezrmle suggested, the connection is timing out? In which case, should the VM not be shut down at that point? A manual VM reset forces a new connection and the Tasks will run, doing useful work again, to the 18hr cutoff. I'm getting full credits for these ones that have had idle time (I actually think I've been getting too much credit recently, compare to previous amounts) but I'd rather those credits were earned fairly by doing actual M^CPlots events. ID: 36761 · Reply Quote

Magic Quantum Mechanic Send message Joined: 24 Oct 04 Posts: 1311 Credit: 97,648,026 RAC: 106,203	Message 36764 - Posted: 19 Sep 2018, 3:51:23 UTC Last modified: 19 Sep 2018, 3:51:48 UTC Well things have been running fine here but something made me check just now and I see I got about 25 of these all in the same hour. [ERROR] Condor exited after 808s without running a job. 2018-09-18 19:47:46 (9884): Guest Log: [INFO] Shutting Down. 2018-09-18 19:47:46 (9884): VM Completion File Detected. 2018-09-18 19:47:46 (9884): VM Completion Message: Condor exited after 808s without running a job. So I will watch for a while before I suspend these Theory tasks. The ones I just watched starting up did get past the HTCondor Ping - OK (0) ID: 36764 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2304 Credit: 179,727,092 RAC: 17,509	Message 36765 - Posted: 19 Sep 2018, 4:04:08 UTC MC-Production is stopped yesterday afternoon: http://mcplots-dev.cern.ch/production.php?view=status&plots=hourly#plots ID: 36765 · Reply Quote

Magic Quantum Mechanic Send message Joined: 24 Oct 04 Posts: 1311 Credit: 97,648,026 RAC: 106,203	Message 36767 - Posted: 19 Sep 2018, 4:23:30 UTC - in response to Message 36765. MC-Production is stopped yesterday afternoon: http://mcplots-dev.cern.ch/production.php?view=status&plots=hourly#plots They stopped working here about 2 hours ago so I am letting the ones that have been running for 5+ hours run and suspended all the rest. I just had a few finished tasks finish and sent in Valid in the last 15 minutes. I will take another look later tonight (9:20pm right now) My old X86 can't even get new work to become errors right now ID: 36767 · Reply Quote

Erich56 Send message Joined: 18 Dec 15 Posts: 1984 Credit: 162,140,367 RAC: 87,130	Message 36769 - Posted: 19 Sep 2018, 5:14:40 UTC Indeed, since last night I received the following error notice for all my Theory tasks: 207 (0x000000CF) EXIT_NO_SUB_TASKS stderr says: 2018-09-19 05:15:12 (13752): Guest Log: [ERROR] No jobs were available to run. Any idea when production of jobs will be resumed? ID: 36769 · Reply Quote

Laurence Project administrator Project developer Send message Joined: 20 Jun 14 Posts: 431 Credit: 256,248 RAC: 59	Message 36773 - Posted: 19 Sep 2018, 8:13:01 UTC - in response to Message 36764. Yes, we are out of jobs. There are non being provided by the mcplots server. The Theory tasks were automatically stopped. This was probably caused by the on going intervention that we have to reboot all hypervisors in our data centre. ID: 36773 · Reply Quote