Message boards : Number crunching : A sudden huge increase in computation errors
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Spatzthecat

Send message
Joined: 2 Apr 10
Posts: 15
Credit: 8,604,036
RAC: 0
Message 27698 - Posted: 14 Mar 2016, 17:29:40 UTC

I normally have very few computation errors but today 14/03/16 there has been a sudden massive increase.
ID: 27698 · Report as offensive     Reply Quote
Stevan Radovic

Send message
Joined: 20 Apr 11
Posts: 3
Credit: 267,858
RAC: 0
Message 27699 - Posted: 14 Mar 2016, 19:58:01 UTC

Same here. I also noticed dips in cpu performance from time to time, like it stops working for a couple of seconds and then it resumes. I think it corresponds to broken WU's but I cant be sure.
ID: 27699 · Report as offensive     Reply Quote
Stevan Radovic

Send message
Joined: 20 Apr 11
Posts: 3
Credit: 267,858
RAC: 0
Message 27700 - Posted: 14 Mar 2016, 20:00:09 UTC

30 WUs with errors just today...
ID: 27700 · Report as offensive     Reply Quote
Spatzthecat

Send message
Joined: 2 Apr 10
Posts: 15
Credit: 8,604,036
RAC: 0
Message 27701 - Posted: 14 Mar 2016, 20:13:45 UTC - in response to Message 27700.  

50 and counting.
ID: 27701 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 27 Oct 07
Posts: 186
Credit: 3,297,640
RAC: 0
Message 27702 - Posted: 14 Mar 2016, 21:41:49 UTC

Are these errors all from the same workunit sequence as

wjt-15-L1-trc_jt-hl1TR-bb-L1__3__s__62.31_60.32__4_6__6__58.5_1_sixvf_boinc260

14/03/2016 21:33:29 | LHC@home 1.0 | Aborting task wjt-15-L1-trc_jt-hl1TR-bb-L1__3__s__62.31_60.32__4_6__6__58.5_1_sixvf_boinc260_3:
exceeded disk limit: 651.56MB > 572.20MB
ID: 27702 · Report as offensive     Reply Quote
USTL-FIL (Lille Fr)

Send message
Joined: 11 Dec 09
Posts: 27
Credit: 236,744,737
RAC: 70
Message 27703 - Posted: 14 Mar 2016, 22:03:56 UTC

900 today for disk limit exceeded!
ID: 27703 · Report as offensive     Reply Quote
Spatzthecat

Send message
Joined: 2 Apr 10
Posts: 15
Credit: 8,604,036
RAC: 0
Message 27704 - Posted: 14 Mar 2016, 22:18:55 UTC

The last 2 units to fail were

wjt-15-L1-trc_jt-hl1TR-bb-L1__23__s__62.31_60.32__4_6__6__18_1_sixvf_boinc2531_0
wjt-15-L1-trc_jt-hl1TR-bb-L1__23__s__62.31_60.32__4_6__6__18_1_sixvf_boinc2531_2
ID: 27704 · Report as offensive     Reply Quote
Spatzthecat

Send message
Joined: 2 Apr 10
Posts: 15
Credit: 8,604,036
RAC: 0
Message 27705 - Posted: 14 Mar 2016, 22:21:30 UTC

wjt-15-L1-trc_jt-hl1TR-bb-L1__23__s__62.31_60.32__4_6__6__18_1_sixvf_boinc2531_2

196 (0xc4) EXIT_DISK_LIMIT_EXCEEDED
ID: 27705 · Report as offensive     Reply Quote
Stevan Radovic

Send message
Joined: 20 Apr 11
Posts: 3
Credit: 267,858
RAC: 0
Message 27706 - Posted: 14 Mar 2016, 22:53:34 UTC

196 (0xc4) EXIT_DISK_LIMIT_EXCEEDED on most of mine. I also have up to 10 with 'Canceled by server' message. I'm assuming they withdrew those ones
ID: 27706 · Report as offensive     Reply Quote
Spatzthecat

Send message
Joined: 2 Apr 10
Posts: 15
Credit: 8,604,036
RAC: 0
Message 27707 - Posted: 15 Mar 2016, 0:17:37 UTC - in response to Message 27706.  

I have only had a couple of those
ID: 27707 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 27708 - Posted: 15 Mar 2016, 3:28:53 UTC - in response to Message 27698.  

Hmmmm, that's bad. CERN is having a lot of problems with
infrastructure today :-( and maybe yesterday. I am actually
travelling so can't do much right now. I'll try and pass the message
and I shall look again this afternoon. Thanks for your message.
Eric.
ID: 27708 · Report as offensive     Reply Quote
Profile Nils Høimyr
Volunteer moderator
Project administrator
Project developer
Project tester

Send message
Joined: 15 Jul 05
Posts: 242
Credit: 5,800,306
RAC: 0
Message 27709 - Posted: 15 Mar 2016, 7:51:28 UTC
Last modified: 15 Mar 2016, 7:51:50 UTC

Your BOINC client might have run out of available disk space. If your PC has space on disk, you can allocate more to BOINC as shown here:

https://boinc.berkeley.edu/wiki/Local_preferences
ID: 27709 · Report as offensive     Reply Quote
Spatzthecat

Send message
Joined: 2 Apr 10
Posts: 15
Credit: 8,604,036
RAC: 0
Message 27710 - Posted: 15 Mar 2016, 11:16:50 UTC - in response to Message 27709.  
Last modified: 15 Mar 2016, 11:26:34 UTC

Double checked preferences and I don't think its that with 100GB allowed but have increased this to 150GB.
ID: 27710 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 27711 - Posted: 15 Mar 2016, 13:53:48 UTC

user is looking at the study; he may be exceeding the estimate. Eric.
ID: 27711 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 27 Oct 07
Posts: 186
Credit: 3,297,640
RAC: 0
Message 27712 - Posted: 15 Mar 2016, 14:01:34 UTC - in response to Message 27709.  

Your BOINC client might have run out of available disk space. If your PC has space on disk, you can allocate more to BOINC as shown here:

https://boinc.berkeley.edu/wiki/Local_preferences

No, I don't think that's it.

exceeded disk limit: 651.56MB > 572.20MB

That particular machine is nearly brand new, with a 1 TB data disk - it's reporting 928.88 GB free, and BOINC is allowed to use 100.00 GB of that.

Other tasks from both LHC and other projects continued running normally. It's more likely that the wjt-15 tasks were given a workunit <rsc_disk_bound> of 600,000,000 - and exceeded even that. I hope it wasn't a result file the experimenter wanted uploading...
ID: 27712 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 27713 - Posted: 15 Mar 2016, 21:45:10 UTC

OK, these studies were writing too much data causing the disk
space exceeded. I still need to look at this as I make an
estimation of the disk space required.......still it is
only an estimate. But there is definitely something to fix
on my side. The studies have been modified to write less
data. Thanks for all the feedback. Eric.
ID: 27713 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 27714 - Posted: 16 Mar 2016, 6:12:26 UTC

User is writing too much data; I am supposed to calculate/estimate
disk usage but that doesn't seem to be working. I am checking all
this today. in the meantime user has fixed the studies.

You can delete any WUS with names like
wjt-18-L1-trc......
wjt-15-L1-trc.......

More news on this when checked. Eric.
ID: 27714 · Report as offensive     Reply Quote
BelgianEnthousiast

Send message
Joined: 5 Apr 15
Posts: 18
Credit: 5,910,849
RAC: 0
Message 27716 - Posted: 16 Mar 2016, 18:30:59 UTC - in response to Message 27713.  

Is it possible it could also hang my system ?

I have only LHC running on the CPU and in // with GPUGrid on the GPU.
GPU seems to work fine, but I have major issues with the LHC WU's.
At least 10 now. Never had any issues with LHC before... :only one failed WU on 15683.

I'll increase the disksize allocated to BOINC to already see if that can
remedy the problem.

Please dig further to see if this could crash a system as well apart from
failed units.

I'm running Win 7 x64 SP 1, all latest updates installed. 32 GB of RAM.
Core I7-5930 @ 3.5 GHz CPU.
I also have Atlas & Oracle Virtualbox installed.

Thanks !

BE.
ID: 27716 · Report as offensive     Reply Quote
Spatzthecat

Send message
Joined: 2 Apr 10
Posts: 15
Credit: 8,604,036
RAC: 0
Message 27718 - Posted: 18 Mar 2016, 0:18:58 UTC - in response to Message 27716.  

Is it possible it could also hang my system ?

I have only LHC running on the CPU and in // with GPUGrid on the GPU.
GPU seems to work fine, but I have major issues with the LHC WU's.
At least 10 now. Never had any issues with LHC before... :only one failed WU on 15683.

I'll increase the disksize allocated to BOINC to already see if that can
remedy the problem.

Please dig further to see if this could crash a system as well apart from
failed units.

I'm running Win 7 x64 SP 1, all latest updates installed. 32 GB of RAM.
Core I7-5930 @ 3.5 GHz CPU.
I also have Atlas & Oracle Virtualbox installed.

Thanks !

BE.



Is everything else up to date ie. version of BOINC and Virtualbox suitable for your system?
Atlas installed doesn't really say a lot.
ID: 27718 · Report as offensive     Reply Quote
BelgianEnthousiast

Send message
Joined: 5 Apr 15
Posts: 18
Credit: 5,910,849
RAC: 0
Message 27720 - Posted: 20 Mar 2016, 19:02:51 UTC - in response to Message 27718.  

Hi,

Thanks for the reply.

The BOINC mgr is at version 7.6.22 (x64) - widgets 3.0.1
The VirtualBox is at version 4.03.12 r 93733
Checking the version of VB, I noticed there's a new version. I'll install
that one already and check if it will run properly now.

In the mean time :
All was running happily for about a year now.
LHC successfully completed 16.245 WU's so far with no errors at all until last week.
Atlas successfully completed 8.455 WU's with sporadic units failing due to problems with the VirtualBox mid last year and an occasional need for a hard reboot (my wrongdoing...).

But what happened last week was really drastic. I had no way of letting BOINC
again 24/7, it hung repeatedly after a couple of minutes running LHC and Atlas.

I expect LHC to be the culprit as it takes full precedence over Atlas. Whenever
a batch LHC comes in BOINC automatically processes LHC first.

Anyway, any ideas would be most welcome. I'll try the new VB in the mean time.

Kind Regards,

BE.
ID: 27720 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : A sudden huge increase in computation errors


©2024 CERN