Message boards :
Number crunching :
w-b3_0_job. killing everything
Message board moderation
Author | Message |
---|---|
Send message Joined: 17 Jul 05 Posts: 102 Credit: 542,016 RAC: 0 |
At 100%, when the WU writes massive ammounts of those "fort" files, the i/o is so heavy that the BOINC core client gets no control anymore and "forgets" to update the heartbeat timestamp - for several minutes (while the CPU efficiency goes down to ~10%). Before they are at 100%, they behave quite normal (for ~45 minutes on my C2Q windows box) but when they reach 100%, they block everything for several minutes, each HDD access on the PC becomes extremely sluggish and all BOINC tasks (including the LHC job itself) throw heatbeat errors for several minutes, while the CPU time on w-b3_0_job nearly does not increase at all. It is possible that it would work on SCSI HDDs with a huge cache and on SSDs, but for running on a "normal" PC this needs a redesign. It might become even worse if a virus scanner would try to grab and check those files but the single(!) w-b3_0_job was already too much for my box - I guess if 4 of those would have been running at 100%, I wouldn't even have had a chance to abort them. p.s.: A Linux wingman seems to have survived the workunit that caused the trouble for me, so it might be windows related. *ix systems tend not to block everything when i/o activity is pending. Maybe you should brand them to be sent solely to Linux |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Sorry; just saw this. I replied to another thread. Didn't think it was so bad...maybe your "Linux only" might be for the best. Eric. |
Send message Joined: 16 Oct 13 Posts: 59 Credit: 342,408 RAC: 0 |
Ananas: for the time being, you can try using a ramdisk, if you have enough RAM. |
Send message Joined: 29 Sep 04 Posts: 281 Credit: 11,859,285 RAC: 0 |
vLHC VMs don't like these ones either throwing I/O errors within the VM. I've suspended it while these w-b3_s go through and will try a manual reset of the VM just in case it is saveable but I suspect I will have to give it a graceful end. I think it's the disk activity rather than memory usage that is the issue. Ordinary job_tracking, job_corr etc are fine with vLHC. Just the HL ones that interfere. Both my machines do very little other than the various Cern projects so I can micro-manage them when I spot any of these bullies coming in. |
©2024 CERN