Message boards : Number crunching : w-b3_0_job. killing everything
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Ananas

Send message
Joined: 17 Jul 05
Posts: 102
Credit: 542,016
RAC: 0
Message 26922 - Posted: 22 Oct 2014, 21:33:53 UTC
Last modified: 22 Oct 2014, 21:37:25 UTC

At 100%, when the WU writes massive ammounts of those "fort" files, the i/o is so heavy that the BOINC core client gets no control anymore and "forgets" to update the heartbeat timestamp - for several minutes (while the CPU efficiency goes down to ~10%).

Before they are at 100%, they behave quite normal (for ~45 minutes on my C2Q windows box) but when they reach 100%, they block everything for several minutes, each HDD access on the PC becomes extremely sluggish and all BOINC tasks (including the LHC job itself) throw heatbeat errors for several minutes, while the CPU time on w-b3_0_job nearly does not increase at all.

It is possible that it would work on SCSI HDDs with a huge cache and on SSDs, but for running on a "normal" PC this needs a redesign. It might become even worse if a virus scanner would try to grab and check those files but the single(!) w-b3_0_job was already too much for my box - I guess if 4 of those would have been running at 100%, I wouldn't even have had a chance to abort them.

p.s.: A Linux wingman seems to have survived the workunit that caused the trouble for me, so it might be windows related. *ix systems tend not to block everything when i/o activity is pending. Maybe you should brand them to be sent solely to Linux
ID: 26922 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 26933 - Posted: 31 Oct 2014, 14:11:12 UTC - in response to Message 26922.  

Sorry; just saw this. I replied to another thread.

Didn't think it was so bad...maybe your "Linux only"
might be for the best. Eric.
ID: 26933 · Report as offensive     Reply Quote
Profile yo2013
Avatar

Send message
Joined: 16 Oct 13
Posts: 59
Credit: 342,408
RAC: 0
Message 26940 - Posted: 1 Nov 2014, 7:54:01 UTC - in response to Message 26933.  

Ananas: for the time being, you can try using a ramdisk, if you have enough RAM.
ID: 26940 · Report as offensive     Reply Quote
Profile Ray Murray
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 281
Credit: 11,869,905
RAC: 65
Message 26941 - Posted: 1 Nov 2014, 18:10:38 UTC
Last modified: 1 Nov 2014, 18:19:49 UTC

vLHC VMs don't like these ones either throwing I/O errors within the VM. I've suspended it while these w-b3_s go through and will try a manual reset of the VM just in case it is saveable but I suspect I will have to give it a graceful end.
I think it's the disk activity rather than memory usage that is the issue.

Ordinary job_tracking, job_corr etc are fine with vLHC. Just the HL ones that interfere. Both my machines do very little other than the various Cern projects so I can micro-manage them when I spot any of these bullies coming in.
ID: 26941 · Report as offensive     Reply Quote

Message boards : Number crunching : w-b3_0_job. killing everything


©2025 CERN