Message boards :
Number crunching :
A sudden huge increase in computation errors
Message board moderation
Author | Message |
---|---|
Send message Joined: 2 Apr 10 Posts: 15 Credit: 8,604,036 RAC: 0 ![]() ![]() |
I normally have very few computation errors but today 14/03/16 there has been a sudden massive increase. |
Send message Joined: 20 Apr 11 Posts: 3 Credit: 267,858 RAC: 0 ![]() ![]() |
Same here. I also noticed dips in cpu performance from time to time, like it stops working for a couple of seconds and then it resumes. I think it corresponds to broken WU's but I cant be sure. |
Send message Joined: 20 Apr 11 Posts: 3 Credit: 267,858 RAC: 0 ![]() ![]() |
30 WUs with errors just today... |
Send message Joined: 2 Apr 10 Posts: 15 Credit: 8,604,036 RAC: 0 ![]() ![]() |
50 and counting. |
Send message Joined: 27 Oct 07 Posts: 186 Credit: 3,297,640 RAC: 0 ![]() |
Are these errors all from the same workunit sequence as wjt-15-L1-trc_jt-hl1TR-bb-L1__3__s__62.31_60.32__4_6__6__58.5_1_sixvf_boinc260 14/03/2016 21:33:29 | LHC@home 1.0 | Aborting task wjt-15-L1-trc_jt-hl1TR-bb-L1__3__s__62.31_60.32__4_6__6__58.5_1_sixvf_boinc260_3: exceeded disk limit: 651.56MB > 572.20MB |
Send message Joined: 11 Dec 09 Posts: 27 Credit: 236,763,011 RAC: 0 |
900 today for disk limit exceeded! |
Send message Joined: 2 Apr 10 Posts: 15 Credit: 8,604,036 RAC: 0 ![]() ![]() |
The last 2 units to fail were wjt-15-L1-trc_jt-hl1TR-bb-L1__23__s__62.31_60.32__4_6__6__18_1_sixvf_boinc2531_0 wjt-15-L1-trc_jt-hl1TR-bb-L1__23__s__62.31_60.32__4_6__6__18_1_sixvf_boinc2531_2 |
Send message Joined: 2 Apr 10 Posts: 15 Credit: 8,604,036 RAC: 0 ![]() ![]() |
wjt-15-L1-trc_jt-hl1TR-bb-L1__23__s__62.31_60.32__4_6__6__18_1_sixvf_boinc2531_2 196 (0xc4) EXIT_DISK_LIMIT_EXCEEDED |
Send message Joined: 20 Apr 11 Posts: 3 Credit: 267,858 RAC: 0 ![]() ![]() |
196 (0xc4) EXIT_DISK_LIMIT_EXCEEDED on most of mine. I also have up to 10 with 'Canceled by server' message. I'm assuming they withdrew those ones |
Send message Joined: 2 Apr 10 Posts: 15 Credit: 8,604,036 RAC: 0 ![]() ![]() |
I have only had a couple of those |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Hmmmm, that's bad. CERN is having a lot of problems with infrastructure today :-( and maybe yesterday. I am actually travelling so can't do much right now. I'll try and pass the message and I shall look again this afternoon. Thanks for your message. Eric. |
![]() Send message Joined: 15 Jul 05 Posts: 250 Credit: 5,974,599 RAC: 0 ![]() ![]() |
Your BOINC client might have run out of available disk space. If your PC has space on disk, you can allocate more to BOINC as shown here: https://boinc.berkeley.edu/wiki/Local_preferences |
Send message Joined: 2 Apr 10 Posts: 15 Credit: 8,604,036 RAC: 0 ![]() ![]() |
Double checked preferences and I don't think its that with 100GB allowed but have increased this to 150GB. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
user is looking at the study; he may be exceeding the estimate. Eric. |
Send message Joined: 27 Oct 07 Posts: 186 Credit: 3,297,640 RAC: 0 ![]() |
Your BOINC client might have run out of available disk space. If your PC has space on disk, you can allocate more to BOINC as shown here: No, I don't think that's it. exceeded disk limit: 651.56MB > 572.20MB That particular machine is nearly brand new, with a 1 TB data disk - it's reporting 928.88 GB free, and BOINC is allowed to use 100.00 GB of that. Other tasks from both LHC and other projects continued running normally. It's more likely that the wjt-15 tasks were given a workunit <rsc_disk_bound> of 600,000,000 - and exceeded even that. I hope it wasn't a result file the experimenter wanted uploading... |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
OK, these studies were writing too much data causing the disk space exceeded. I still need to look at this as I make an estimation of the disk space required.......still it is only an estimate. But there is definitely something to fix on my side. The studies have been modified to write less data. Thanks for all the feedback. Eric. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
User is writing too much data; I am supposed to calculate/estimate disk usage but that doesn't seem to be working. I am checking all this today. in the meantime user has fixed the studies. You can delete any WUS with names like wjt-18-L1-trc...... wjt-15-L1-trc....... More news on this when checked. Eric. |
Send message Joined: 5 Apr 15 Posts: 18 Credit: 5,910,849 RAC: 0 ![]() ![]() |
Is it possible it could also hang my system ? I have only LHC running on the CPU and in // with GPUGrid on the GPU. GPU seems to work fine, but I have major issues with the LHC WU's. At least 10 now. Never had any issues with LHC before... :only one failed WU on 15683. I'll increase the disksize allocated to BOINC to already see if that can remedy the problem. Please dig further to see if this could crash a system as well apart from failed units. I'm running Win 7 x64 SP 1, all latest updates installed. 32 GB of RAM. Core I7-5930 @ 3.5 GHz CPU. I also have Atlas & Oracle Virtualbox installed. Thanks ! BE. |
Send message Joined: 2 Apr 10 Posts: 15 Credit: 8,604,036 RAC: 0 ![]() ![]() |
Is it possible it could also hang my system ? Is everything else up to date ie. version of BOINC and Virtualbox suitable for your system? Atlas installed doesn't really say a lot. |
Send message Joined: 5 Apr 15 Posts: 18 Credit: 5,910,849 RAC: 0 ![]() ![]() |
Hi, Thanks for the reply. The BOINC mgr is at version 7.6.22 (x64) - widgets 3.0.1 The VirtualBox is at version 4.03.12 r 93733 Checking the version of VB, I noticed there's a new version. I'll install that one already and check if it will run properly now. In the mean time : All was running happily for about a year now. LHC successfully completed 16.245 WU's so far with no errors at all until last week. Atlas successfully completed 8.455 WU's with sporadic units failing due to problems with the VirtualBox mid last year and an occasional need for a hard reboot (my wrongdoing...). But what happened last week was really drastic. I had no way of letting BOINC again 24/7, it hung repeatedly after a couple of minutes running LHC and Atlas. I expect LHC to be the culprit as it takes full precedence over Atlas. Whenever a batch LHC comes in BOINC automatically processes LHC first. Anyway, any ideas would be most welcome. I'll try the new VB in the mean time. Kind Regards, BE. |
©2025 CERN