Thread 'w-b3_0_job. killing everything'

Author	Message
Ananas Send message Joined: 17 Jul 05 Posts: 102 Credit: 542,016 RAC: 0	Message 26922 - Posted: 22 Oct 2014, 21:33:53 UTC Last modified: 22 Oct 2014, 21:37:25 UTC At 100%, when the WU writes massive ammounts of those "fort" files, the i/o is so heavy that the BOINC core client gets no control anymore and "forgets" to update the heartbeat timestamp - for several minutes (while the CPU efficiency goes down to ~10%). Before they are at 100%, they behave quite normal (for ~45 minutes on my C2Q windows box) but when they reach 100%, they block everything for several minutes, each HDD access on the PC becomes extremely sluggish and all BOINC tasks (including the LHC job itself) throw heatbeat errors for several minutes, while the CPU time on w-b3_0_job nearly does not increase at all. It is possible that it would work on SCSI HDDs with a huge cache and on SSDs, but for running on a "normal" PC this needs a redesign. It might become even worse if a virus scanner would try to grab and check those files but the single(!) w-b3_0_job was already too much for my box - I guess if 4 of those would have been running at 100%, I wouldn't even have had a chance to abort them. p.s.: A Linux wingman seems to have survived the workunit that caused the trouble for me, so it might be windows related. *ix systems tend not to block everything when i/o activity is pending. Maybe you should brand them to be sent solely to Linux ID: 26922 · Reply Quote

Eric Mcintosh Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0	Message 26933 - Posted: 31 Oct 2014, 14:11:12 UTC - in response to Message 26922. Sorry; just saw this. I replied to another thread. Didn't think it was so bad...maybe your "Linux only" might be for the best. Eric. ID: 26933 · Reply Quote

yo2013 Send message Joined: 16 Oct 13 Posts: 59 Credit: 342,408 RAC: 0	Message 26940 - Posted: 1 Nov 2014, 7:54:01 UTC - in response to Message 26933. Ananas: for the time being, you can try using a ramdisk, if you have enough RAM. ID: 26940 · Reply Quote

Ray Murray Volunteer moderator Send message Joined: 29 Sep 04 Posts: 281 Credit: 11,888,115 RAC: 0	Message 26941 - Posted: 1 Nov 2014, 18:10:38 UTC Last modified: 1 Nov 2014, 18:19:49 UTC vLHC VMs don't like these ones either throwing I/O errors within the VM. I've suspended it while these w-b3_s go through and will try a manual reset of the VM just in case it is saveable but I suspect I will have to give it a graceful end. I think it's the disk activity rather than memory usage that is the issue. Ordinary job_tracking, job_corr etc are fine with vLHC. Just the HL ones that interfere. Both my machines do very little other than the various Cern projects so I can micro-manage them when I spot any of these bullies coming in. ID: 26941 · Reply Quote