Write to disk NOT honored ! !

Author	Message
Thierry Van Driessche Send message Joined: 1 Sep 04 Posts: 157 Credit: 82,604 RAC: 0	Message 6268 - Posted: 2 Mar 2005, 16:15:33 UTC Last modified: 2 Mar 2005, 16:40:58 UTC I have been watching in the folder "slots" when the application is writing to disk. My settings are: Leave applications in memory while preempted? (suspended applications will consume swap space if 'yes') yes Switch between applications every (recommended: 60 minutes) 60 minutes On multiprocessors, use at most 2 processors Disk and memory usage Write to disk at most every 60 seconds Use no more than 75% of total virtual memory Using Boinc v4.19, sixtrack v4.64. During the whole time of processing, only data has been written to the folder "slots" when the WU was starting to be crunched. After that, no data has been written anymore. Looking to E@H when a WU is crunched, each 1 minute files are updated. Is this the reason why we are having trouble with the 00:00:00 processing time? And could this explain why when restarting Boinc the WU's are restarting crunching with a 00:00 CPU time even if they have already been processed for a while? Best greetings from Belgium Thierry ID: 6268 · Reply Quote

Thierry Van Driessche Send message Joined: 1 Sep 04 Posts: 157 Credit: 82,604 RAC: 0	Message 6280 - Posted: 2 Mar 2005, 19:42:52 UTC Last modified: 2 Mar 2005, 19:43:23 UTC It is becoming stranger and stranger. I had 2 WUâ€™s crunching: First WU (was left in memory): LHC@home - 2005-03-02 19:01:33 - Restarting result v64lhc90-47s10_12560_1_sixvf_5785_0 using sixtrack version 4.64 Files in the "slots" folder were overwritten each minute for this WU. Then the crunching was finished: LHC@home - 2005-03-02 20:10:28 - Computation for result v64lhc90-47s10_12560_1_sixvf_5785 finished LHC@home - 2005-03-02 20:10:28 - Started upload of v64lhc90-47s10_12560_1_sixvf_5785_0_0 LHC@home - 2005-03-02 20:10:35 - Finished upload of v64lhc90-47s10_12560_1_sixvf_5785_0_0 LHC@home - 2005-03-02 20:10:35 - Throughput 9830 bytes/sec But the final CPU time is 00:00:00. Second WU: LHC@home - 2005-03-02 19:18:04 - Starting result v64lhc90-44s8_10530_1_sixvf_5769_0 using sixtrack version 4.64 Then this WU was paused: LHC@home - 2005-03-02 20:10:28 - Pausing result v64lhc90-44s8_10530_1_sixvf_5769_0 (left in memory) All the files regarding this WU in the "slots" had a time stamp of 19:18, only the fort.91 file was overwritten at 19:25h. No I'm not dreaming, checked it 3 times, using refresh of Win explorer. The content of this file is "70800 100000" Actual status of that WU: CPU time 00:47:20, 70,89% done, to completion 00:19:26. How can the fort.91 file with time stamp of 19:25 have the content of 70800 (representing the % as I would guess) although the WU was paused only at 20:10:28 with 70,89%??? To be continued. ID: 6280 · Reply Quote

Thierry Van Driessche Send message Joined: 1 Sep 04 Posts: 157 Credit: 82,604 RAC: 0	Message 6285 - Posted: 2 Mar 2005, 20:33:40 UTC The WU left in memory start crunching again. LHC@home - 2005-03-02 20:59:37 - Resuming result v64lhc90-44s8_10530_1_sixvf_5769_0 using sixtrack version 4.64 The crunching of the WU starts with a correct CPU time and % done. In the "slots" folder no files have a new time stamp (at 21:13h.) Then it finished : LHC@home - 2005-03-02 21:20:32 - Computation for result v64lhc90-44s8_10530_1_sixvf_5769 finished A lot of files were overwritten in the â€œslotsâ€ at 21:20 and then like normal these files were deleted. And guess, CPU time reported as 00:00:00 ID: 6285 · Reply Quote

John McLeod VII Send message Joined: 2 Sep 04 Posts: 165 Credit: 146,925 RAC: 0	Message 6295 - Posted: 3 Mar 2005, 1:15:42 UTC Last modified: 3 Mar 2005, 1:18:06 UTC It is "Write to disk at most every xx seconds. This is the most frequent that you are going to allow the project to write to disk. Each science application writes to disk at whatever checkpoints the science code has. I believe that LHC can write continuously, so it does update every xx seconds if there is any data that needs writing. But apparently not every file is update every time. The number 70800 probably represents the turn number. I don't know what the other number represents. [EDIT] After reading a couple of other posts, it is apparent that the work is not saved very often. Therefore, the checkpoints (if they are working at all) must be very far apart. BOINC WIKI ID: 6295 · Reply Quote

Ben Segal Volunteer moderator Project administrator Send message Joined: 1 Sep 04 Posts: 141 Credit: 2,579 RAC: 0	Message 6311 - Posted: 3 Mar 2005, 14:45:07 UTC Thanks a lot, Thierry and John, for your hard work and careful observations of this "zero CPU" problem. The Sixtrack developers are very grateful to you (and other users in other threads) in pointing them to the likely cause. Ben Segal / LHC@home ID: 6311 · Reply Quote

Thierry Van Driessche Send message Joined: 1 Sep 04 Posts: 157 Credit: 82,604 RAC: 0	Message 6315 - Posted: 3 Mar 2005, 16:03:15 UTC - in response to Message 6311. Last modified: 3 Mar 2005, 16:05:47 UTC > Thanks a lot, Thierry and John, for your hard work and careful observations of > this "zero CPU" problem. The Sixtrack developers are very grateful to you (and > other users in other threads) in pointing them to the likely cause. > > Ben Segal / LHC@home Dear Ben, I hope with these informations the developers will be capable of tracking the exact reason why this "zero problem" occurs. Good luck there! and keep all of us informed. Best regards, Thierry ID: 6315 · Reply Quote

Paul D. Buck Send message Joined: 2 Sep 04 Posts: 545 Credit: 148,912 RAC: 0	Message 6320 - Posted: 3 Mar 2005, 19:00:26 UTC Last modified: 3 Mar 2005, 19:42:50 UTC Well, it took me longer than I had expected. It turns out that the dates saved by BOINC View changed format and mid horse ... so I had to fix that ... BUT, I do have more interesting data for what it is worth. Looking at the data, and correlating it to WU type (using the first 6 letters of the WU name as a discriminator) we get these results: SELECT SubString(`ResultName`, 1, 6) ResName, count(AppName) MyCount, Floor(`FinalCPUTime`), AppName, `ExitStatus` FROM `BOINCViewLog` WHERE `AppName` = 'sixtrack' AND FinalCPUTime > 0 GROUP BY ResName Union SELECT SubString(`ResultName`, 1, 6) ResName, count(AppName) MyCount, Floor(`FinalCPUTime`), AppName, `ExitStatus` FROM `BOINCViewLog` WHERE `AppName` = 'sixtrack' AND FinalCPUTime = 0 GROUP BY ResName ORDER BY ResName ResName MyCount FinalCPU App ExitStatus v64bbe 1 1641 sixtrack 0 v64boi 16 2844 sixtrack 0 v64boi 3 0 sixtrack 0 v64lhc 3056 3037 sixtrack 0 v64lhc 385 0 sixtrack 0 v64tun 68 8867 sixtrack 0 v64tun 8 0 sixtrack 0 If we change it to "ORDER BY appversion, ResName", we get: v64lhc10 2991 3037 sixtrack 0 4.45 v64lhc10 352 0 sixtrack 0 4.45 v64bbe6i 1 1641 sixtrack 0 4.46 v64boinc 16 2844 sixtrack 0 4.46 v64tunes 68 8867 sixtrack 0 4.46 v64tunes 8 0 sixtrack 0 4.46 v64lhc94 2 4462 sixtrack 0 4.63 v64lhc95 3 4015 sixtrack 0 4.63 v64lhc95 1 0 sixtrack 0 4.63 v64lhc96 23 3656 sixtrack 0 4.63 v64lhc96 6 0 sixtrack 0 4.63 v64lhc97 7 2595 sixtrack 0 4.63 v64boinc 3 0 sixtrack 0 4.64 v64lhc86 1 4058 sixtrack 0 4.64 v64lhc86 2 0 sixtrack 0 4.64 v64lhc88 2 4190 sixtrack 0 4.64 v64lhc88 2 0 sixtrack 0 4.64 v64lhc89 5 298 sixtrack 0 4.64 v64lhc89 5 0 sixtrack 0 4.64 v64lhc90 5 7 sixtrack 0 4.64 v64lhc90 2 0 sixtrack 0 4.64 v64lhc91 1 3983 sixtrack 0 4.64 v64lhc91 1 0 sixtrack 0 4.64 v64lhc92 14 24 sixtrack 0 4.64 v64lhc92 13 0 sixtrack 0 4.64 v64lhc93 2 2175 sixtrack 0 4.64 v64lhc93 1 0 sixtrack 0 4.64 Now, what was most interesting to me was as I was gathering the data I was watching on LHC WU complete. It came up to 15 hours 4 Min and some odd seconds. BUt then it hit 100% and the time reported for CPU time was 14 hours 41 minutes and change ... (Result: 316488). So, the problem may have some other attributes. In this specific case it lopped off 9 plus minutes (closer to 14 minutes because I think it was 15 hours 4 min and some odd seconds). In any case, this would make sense in the context that the SHORTER work units are most affected. That is the sense I got out of the numbers that I ran today. So, if it is a checkpointing problem, that would show up both in the restarts and the completions. If the completed time is handled as the last checkpointed time, and the WU is not checkpointed at completion, that would show as a truncated time. In the case of short WU that were longer than 0, but less than checkpoint will show up as zero ... [edit] I just had another 0 time, and it was a v64boinc WU, but the interesting thing is that I watched it download, start, and run for a bit. I was doing other things so, I don't know how far it got before it died, but it does look like it did not run all that long, certainly less than 30 minutes ... NOW I am wondering if the LHC@Home checkpoint interval is 15 minutes or so (20?) ... that again explains the "truncation" of the run time ... ID: 6320 · Reply Quote

Ben Segal Volunteer moderator Project administrator Send message Joined: 1 Sep 04 Posts: 141 Credit: 2,579 RAC: 0	Message 6323 - Posted: 3 Mar 2005, 22:33:01 UTC - in response to Message 6320. > ... > In any case, this would make sense in the context that the SHORTER work units > are most affected. That is the sense I got out of the numbers that I ran > today. So, if it is a checkpointing problem, that would show up both in the > restarts and the completions. > Agreed. > If the completed time is handled as the last checkpointed time, and the WU is > not checkpointed at completion, that would show as a truncated time. In the > case of short WU that were longer than 0, but less than checkpoint will show > up as zero ... > > [edit] > I just had another 0 time, and it was a v64boinc WU, but the interesting thing > is that I watched it download, start, and run for a bit. I was doing other > things so, I don't know how far it got before it died, but it does look like > it did not run all that long, certainly less than 30 minutes ... NOW I am > wondering if the LHC@Home checkpoint interval is 15 minutes or so (20?) ... > that again explains the "truncation" of the run time ... > Well, Paul, though it's not so simple you could be right: Sixtrack _tries_ to checkpoint regularly and frequently - every 100 turns (for a 100K turn job) and every 1000 turns (for a million turn job). When Sixtrack runs in a non-BOINC environment we see no problems at all with checkpointing. But under BOINC, Sixtrack can only _request_ checkpointing and the BOINC interface decides if a checkpoint is actually done or not. So under BOINC, checkpoint intervals are variable and could well be as long as 15-20 minutes. But some cases we see would imply that no checkpoint had been done at all, or (more likely?) that the checkpoint file is deleted too quickly at job termination, before the final CPU time is read out. We suspect we have problems with the (recently changed) BOINC API used to build Sixtrack for LHC@home, and are taking a close look at this area. Thanks again for your interest and constructive help! Ben ID: 6323 · Reply Quote

Paul D. Buck Send message Joined: 2 Sep 04 Posts: 545 Credit: 148,912 RAC: 0	Message 6327 - Posted: 4 Mar 2005, 2:17:22 UTC - in response to Message 6323. Last modified: 4 Mar 2005, 2:20:17 UTC > Well, Paul, though it's not so simple you could be right: Sixtrack _tries_ to > checkpoint regularly and frequently - every 100 turns (for a 100K turn job) > and every 1000 turns (for a million turn job). When Sixtrack runs in a > non-BOINC environment we see no problems at all with checkpointing. But under > BOINC, Sixtrack can only _request_ checkpointing and the BOINC interface > decides if a checkpoint is actually done or not. So under BOINC, checkpoint > intervals are variable and could well be as long as 15-20 minutes. But some > cases we see would imply that no checkpoint had been done at all, or (more > likely?) that the checkpoint file is deleted too quickly at job termination, > before the final CPU time is read out. > > We suspect we have problems with the (recently changed) BOINC API used to > build Sixtrack for LHC@home, and are taking a close look at this area. > > Thanks again for your interest and constructive help! > Ben, I did play with the program and check point interval and my original was 1800 seconds (I did that as I was assuming no restarts on my computers as a normal situation) and with that, if I killed and restarted BOINC I would see it reset to 0 as expected. Changing to 60 seconds and doing stops and restarts in one minute intervals seemed to also checkpoint right. At least on that one machine, and so far. Since I seem to be coming out of my "funk", at least as far as this goes ... made me re-attach all my machines and I will load the logs later tonight or tomorrow and take another look to see if I see anything of interest. I seem to be working better today as far as this goes. And with the 60 second allowance it seems to be working, which is a smilie or frown depending on what your really wish. Ben, one more thought; you might try to make a special build of your program to write to a text file independently of the BOINC interface to capture the run time/check point request times ... some of us I am sure would be more than happy to run this on one or more systems to see if we can nail down what is going on with the application... ID: 6327 · Reply Quote

Thierry Van Driessche Send message Joined: 1 Sep 04 Posts: 157 Credit: 82,604 RAC: 0	Message 6360 - Posted: 4 Mar 2005, 18:47:28 UTC Last modified: 4 Mar 2005, 19:00:01 UTC I made a new discovery: 2 WU's from the type v64boince6ib1-xxx were crunching. For one of these WU's, files were overwritten in the slots on a regular time, each minute, NOT for the other WU. I shut down Boinc. At that moment, most of the files in the slots of both WU's were overwritten. Then I started over again. Both WU's were crunching again, restarting at the right CPU time, BUT now for BOTH WU's the files were overwritten each minutes. So, how comes the fact that after restarting Boinc in both slots the files are now overwritten and not before??? Best greetings from Belgium Thierry ID: 6360 · Reply Quote

Paul D. Buck Send message Joined: 2 Sep 04 Posts: 545 Credit: 148,912 RAC: 0	Message 6361 - Posted: 4 Mar 2005, 19:04:05 UTC - in response to Message 6360. > I made a new discovery: > > 2 WU's from the type v64boince6ibxxx were crunching. For one of these WU's, > files were overwritten in the slots on a regular time, each minute, NOT > for the other WU. > > I shut down Boinc. At that moment, most of the files in the slots of both WU's > were overwritten. > > Then I started over again. Both WU's were crunching again, restarting at the > right CPU time, BUT now for BOTH WU's the files were overwritten > each minutes. > > So, how comes the fact that after restarting Boinc in both slots the > files are now overwritten and not before??? Are you running Dual processors or HT machines? I wonder if that is a common point? Hmm, hard for me to check well ... nope, just as bad though, my one non-HT machine has a fairly high number of failures ... so, not just, dual processors ... rats ... It still can be bad file handle management ... possibly on the part of BOINC? ID: 6361 · Reply Quote

Thierry Van Driessche Send message Joined: 1 Sep 04 Posts: 157 Credit: 82,604 RAC: 0	Message 6363 - Posted: 4 Mar 2005, 19:26:01 UTC - in response to Message 6361. > Are you running Dual processors or HT machines? I wonder if that is a common > point? Hmm, hard for me to check well ... nope, just as bad though, my one > non-HT machine has a fairly high number of failures ... so, not just, dual > processors ... rats ... > > It still can be bad file handle management ... possibly on the part of BOINC? Hi Paul, I'm running a HT CPU. Best regards, Thierry ID: 6363 · Reply Quote

Paul D. Buck Send message Joined: 2 Sep 04 Posts: 545 Credit: 148,912 RAC: 0	Message 6379 - Posted: 4 Mar 2005, 23:38:58 UTC - in response to Message 6363. > Hi Paul, Hi Thierry! > I'm running a HT CPU. Me too on 4 machines. But I have one single thread processor in my 5 and it has 0 time Results ... so, well, if it was simple anyone could play ... > Best regards, And my best to you ... ID: 6379 · Reply Quote

LHC@home