Thread 'A sudden huge increase in computation errors'

Author	Message
Spatzthecat Send message Joined: 2 Apr 10 Posts: 15 Credit: 8,604,036 RAC: 0	Message 27698 - Posted: 14 Mar 2016, 17:29:40 UTC I normally have very few computation errors but today 14/03/16 there has been a sudden massive increase. ID: 27698 · Reply Quote

Stevan Radovic Send message Joined: 20 Apr 11 Posts: 3 Credit: 267,858 RAC: 0	Message 27699 - Posted: 14 Mar 2016, 19:58:01 UTC Same here. I also noticed dips in cpu performance from time to time, like it stops working for a couple of seconds and then it resumes. I think it corresponds to broken WU's but I cant be sure. ID: 27699 · Reply Quote

Stevan Radovic Send message Joined: 20 Apr 11 Posts: 3 Credit: 267,858 RAC: 0	Message 27700 - Posted: 14 Mar 2016, 20:00:09 UTC 30 WUs with errors just today... ID: 27700 · Reply Quote

Spatzthecat Send message Joined: 2 Apr 10 Posts: 15 Credit: 8,604,036 RAC: 0	Message 27701 - Posted: 14 Mar 2016, 20:13:45 UTC - in response to Message 27700. 50 and counting. ID: 27701 · Reply Quote

Richard Haselgrove Send message Joined: 27 Oct 07 Posts: 186 Credit: 3,297,640 RAC: 0	Message 27702 - Posted: 14 Mar 2016, 21:41:49 UTC Are these errors all from the same workunit sequence as wjt-15-L1-trc_jt-hl1TR-bb-L1__3__s__62.31_60.32__4_6__6__58.5_1_sixvf_boinc260 14/03/2016 21:33:29 \| LHC@home 1.0 \| Aborting task wjt-15-L1-trc_jt-hl1TR-bb-L1__3__s__62.31_60.32__4_6__6__58.5_1_sixvf_boinc260_3: exceeded disk limit: 651.56MB > 572.20MB ID: 27702 · Reply Quote

USTL-FIL (Lille Fr) Send message Joined: 11 Dec 09 Posts: 27 Credit: 236,763,011 RAC: 0	Message 27703 - Posted: 14 Mar 2016, 22:03:56 UTC 900 today for disk limit exceeded! ID: 27703 · Reply Quote

Spatzthecat Send message Joined: 2 Apr 10 Posts: 15 Credit: 8,604,036 RAC: 0	Message 27704 - Posted: 14 Mar 2016, 22:18:55 UTC The last 2 units to fail were wjt-15-L1-trc_jt-hl1TR-bb-L1__23__s__62.31_60.32__4_6__6__18_1_sixvf_boinc2531_0 wjt-15-L1-trc_jt-hl1TR-bb-L1__23__s__62.31_60.32__4_6__6__18_1_sixvf_boinc2531_2 ID: 27704 · Reply Quote

Spatzthecat Send message Joined: 2 Apr 10 Posts: 15 Credit: 8,604,036 RAC: 0	Message 27705 - Posted: 14 Mar 2016, 22:21:30 UTC wjt-15-L1-trc_jt-hl1TR-bb-L1__23__s__62.31_60.32__4_6__6__18_1_sixvf_boinc2531_2 196 (0xc4) EXIT_DISK_LIMIT_EXCEEDED ID: 27705 · Reply Quote

Stevan Radovic Send message Joined: 20 Apr 11 Posts: 3 Credit: 267,858 RAC: 0	Message 27706 - Posted: 14 Mar 2016, 22:53:34 UTC 196 (0xc4) EXIT_DISK_LIMIT_EXCEEDED on most of mine. I also have up to 10 with 'Canceled by server' message. I'm assuming they withdrew those ones ID: 27706 · Reply Quote

Spatzthecat Send message Joined: 2 Apr 10 Posts: 15 Credit: 8,604,036 RAC: 0	Message 27707 - Posted: 15 Mar 2016, 0:17:37 UTC - in response to Message 27706. I have only had a couple of those ID: 27707 · Reply Quote

Eric Mcintosh Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0	Message 27708 - Posted: 15 Mar 2016, 3:28:53 UTC - in response to Message 27698. Hmmmm, that's bad. CERN is having a lot of problems with infrastructure today :-( and maybe yesterday. I am actually travelling so can't do much right now. I'll try and pass the message and I shall look again this afternoon. Thanks for your message. Eric. ID: 27708 · Reply Quote

Nils Volunteer moderator Project administrator Project developer Project tester Send message Joined: 15 Jul 05 Posts: 254 Credit: 6,001,083 RAC: 0	Message 27709 - Posted: 15 Mar 2016, 7:51:28 UTC Last modified: 15 Mar 2016, 7:51:50 UTC Your BOINC client might have run out of available disk space. If your PC has space on disk, you can allocate more to BOINC as shown here: https://boinc.berkeley.edu/wiki/Local_preferences ID: 27709 · Reply Quote

Spatzthecat Send message Joined: 2 Apr 10 Posts: 15 Credit: 8,604,036 RAC: 0	Message 27710 - Posted: 15 Mar 2016, 11:16:50 UTC - in response to Message 27709. Last modified: 15 Mar 2016, 11:26:34 UTC Double checked preferences and I don't think its that with 100GB allowed but have increased this to 150GB. ID: 27710 · Reply Quote

Eric Mcintosh Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0	Message 27711 - Posted: 15 Mar 2016, 13:53:48 UTC user is looking at the study; he may be exceeding the estimate. Eric. ID: 27711 · Reply Quote

Richard Haselgrove Send message Joined: 27 Oct 07 Posts: 186 Credit: 3,297,640 RAC: 0	Message 27712 - Posted: 15 Mar 2016, 14:01:34 UTC - in response to Message 27709. Your BOINC client might have run out of available disk space. If your PC has space on disk, you can allocate more to BOINC as shown here: https://boinc.berkeley.edu/wiki/Local_preferences No, I don't think that's it. exceeded disk limit: 651.56MB > 572.20MB That particular machine is nearly brand new, with a 1 TB data disk - it's reporting 928.88 GB free, and BOINC is allowed to use 100.00 GB of that. Other tasks from both LHC and other projects continued running normally. It's more likely that the wjt-15 tasks were given a workunit <rsc_disk_bound> of 600,000,000 - and exceeded even that. I hope it wasn't a result file the experimenter wanted uploading... ID: 27712 · Reply Quote

Eric Mcintosh Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0	Message 27713 - Posted: 15 Mar 2016, 21:45:10 UTC OK, these studies were writing too much data causing the disk space exceeded. I still need to look at this as I make an estimation of the disk space required.......still it is only an estimate. But there is definitely something to fix on my side. The studies have been modified to write less data. Thanks for all the feedback. Eric. ID: 27713 · Reply Quote

Eric Mcintosh Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0	Message 27714 - Posted: 16 Mar 2016, 6:12:26 UTC User is writing too much data; I am supposed to calculate/estimate disk usage but that doesn't seem to be working. I am checking all this today. in the meantime user has fixed the studies. You can delete any WUS with names like wjt-18-L1-trc...... wjt-15-L1-trc....... More news on this when checked. Eric. ID: 27714 · Reply Quote

BelgianEnthousiast Send message Joined: 5 Apr 15 Posts: 18 Credit: 5,910,849 RAC: 0	Message 27716 - Posted: 16 Mar 2016, 18:30:59 UTC - in response to Message 27713. Is it possible it could also hang my system ? I have only LHC running on the CPU and in // with GPUGrid on the GPU. GPU seems to work fine, but I have major issues with the LHC WU's. At least 10 now. Never had any issues with LHC before... :only one failed WU on 15683. I'll increase the disksize allocated to BOINC to already see if that can remedy the problem. Please dig further to see if this could crash a system as well apart from failed units. I'm running Win 7 x64 SP 1, all latest updates installed. 32 GB of RAM. Core I7-5930 @ 3.5 GHz CPU. I also have Atlas & Oracle Virtualbox installed. Thanks ! BE. ID: 27716 · Reply Quote

Spatzthecat Send message Joined: 2 Apr 10 Posts: 15 Credit: 8,604,036 RAC: 0	Message 27718 - Posted: 18 Mar 2016, 0:18:58 UTC - in response to Message 27716. Is it possible it could also hang my system ? I have only LHC running on the CPU and in // with GPUGrid on the GPU. GPU seems to work fine, but I have major issues with the LHC WU's. At least 10 now. Never had any issues with LHC before... :only one failed WU on 15683. I'll increase the disksize allocated to BOINC to already see if that can remedy the problem. Please dig further to see if this could crash a system as well apart from failed units. I'm running Win 7 x64 SP 1, all latest updates installed. 32 GB of RAM. Core I7-5930 @ 3.5 GHz CPU. I also have Atlas & Oracle Virtualbox installed. Thanks ! BE. Is everything else up to date ie. version of BOINC and Virtualbox suitable for your system? Atlas installed doesn't really say a lot. ID: 27718 · Reply Quote

BelgianEnthousiast Send message Joined: 5 Apr 15 Posts: 18 Credit: 5,910,849 RAC: 0	Message 27720 - Posted: 20 Mar 2016, 19:02:51 UTC - in response to Message 27718. Hi, Thanks for the reply. The BOINC mgr is at version 7.6.22 (x64) - widgets 3.0.1 The VirtualBox is at version 4.03.12 r 93733 Checking the version of VB, I noticed there's a new version. I'll install that one already and check if it will run properly now. In the mean time : All was running happily for about a year now. LHC successfully completed 16.245 WU's so far with no errors at all until last week. Atlas successfully completed 8.455 WU's with sporadic units failing due to problems with the VirtualBox mid last year and an occasional need for a hard reboot (my wrongdoing...). But what happened last week was really drastic. I had no way of letting BOINC again 24/7, it hung repeatedly after a couple of minutes running LHC and Atlas. I expect LHC to be the culprit as it takes full precedence over Atlas. Whenever a batch LHC comes in BOINC automatically processes LHC first. Anyway, any ideas would be most welcome. I'll try the new VB in the mean time. Kind Regards, BE. ID: 27720 · Reply Quote