Message boards : Number crunching : Write to disk NOT honored ! !
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Thierry Van Driessche
Avatar

Send message
Joined: 1 Sep 04
Posts: 157
Credit: 82,604
RAC: 0
Message 6268 - Posted: 2 Mar 2005, 16:15:33 UTC
Last modified: 2 Mar 2005, 16:40:58 UTC

I have been watching in the folder "slots" when the application is writing to disk. My settings are:

Leave applications in memory while preempted?
(suspended applications will consume swap space if 'yes') yes
Switch between applications every
(recommended: 60 minutes) 60 minutes
On multiprocessors, use at most 2 processors
Disk and memory usage
Write to disk at most every 60 seconds
Use no more than 75% of total virtual memory

Using Boinc v4.19, sixtrack v4.64.

During the whole time of processing, only data has been written to the folder "slots" when the WU was starting to be crunched. After that, no data has been written anymore.

Looking to E@H when a WU is crunched, each 1 minute files are updated.

Is this the reason why we are having trouble with the 00:00:00 processing time?

And could this explain why when restarting Boinc the WU's are restarting crunching with a 00:00 CPU time even if they have already been processed for a while?

Best greetings from Belgium
Thierry
ID: 6268 · Report as offensive     Reply Quote
Profile Thierry Van Driessche
Avatar

Send message
Joined: 1 Sep 04
Posts: 157
Credit: 82,604
RAC: 0
Message 6280 - Posted: 2 Mar 2005, 19:42:52 UTC
Last modified: 2 Mar 2005, 19:43:23 UTC

It is becoming stranger and stranger.

I had 2 WU’s crunching:

First WU (was left in memory):
LHC@home - 2005-03-02 19:01:33 - Restarting result v64lhc90-47s10_12560_1_sixvf_5785_0 using sixtrack version 4.64

Files in the "slots" folder were overwritten each minute for this WU. Then the crunching was finished:

LHC@home - 2005-03-02 20:10:28 - Computation for result v64lhc90-47s10_12560_1_sixvf_5785 finished
LHC@home - 2005-03-02 20:10:28 - Started upload of v64lhc90-47s10_12560_1_sixvf_5785_0_0
LHC@home - 2005-03-02 20:10:35 - Finished upload of v64lhc90-47s10_12560_1_sixvf_5785_0_0
LHC@home - 2005-03-02 20:10:35 - Throughput 9830 bytes/sec

But the final CPU time is 00:00:00.

Second WU:
LHC@home - 2005-03-02 19:18:04 - Starting result v64lhc90-44s8_10530_1_sixvf_5769_0 using sixtrack version 4.64

Then this WU was paused:

LHC@home - 2005-03-02 20:10:28 - Pausing result v64lhc90-44s8_10530_1_sixvf_5769_0 (left in memory)

All the files regarding this WU in the "slots" had a time stamp of 19:18, only the fort.91 file was overwritten at 19:25h. No I'm not dreaming, checked it 3 times, using refresh of Win explorer. The content of this file is "70800 100000"

Actual status of that WU: CPU time 00:47:20, 70,89% done, to completion 00:19:26.

How can the fort.91 file with time stamp of 19:25 have the content of 70800 (representing the % as I would guess) although the WU was paused only at 20:10:28 with 70,89%???

To be continued.
ID: 6280 · Report as offensive     Reply Quote
Profile Thierry Van Driessche
Avatar

Send message
Joined: 1 Sep 04
Posts: 157
Credit: 82,604
RAC: 0
Message 6285 - Posted: 2 Mar 2005, 20:33:40 UTC

The WU left in memory start crunching again.

LHC@home - 2005-03-02 20:59:37 - Resuming result v64lhc90-44s8_10530_1_sixvf_5769_0 using sixtrack version 4.64

The crunching of the WU starts with a correct CPU time and % done. In the "slots" folder no files have a new time stamp (at 21:13h.)

Then it finished :

LHC@home - 2005-03-02 21:20:32 - Computation for result v64lhc90-44s8_10530_1_sixvf_5769 finished

A lot of files were overwritten in the “slots” at 21:20 and then like normal these files were deleted.

And guess, CPU time reported as 00:00:00
ID: 6285 · Report as offensive     Reply Quote
John McLeod VII
Avatar

Send message
Joined: 2 Sep 04
Posts: 165
Credit: 146,925
RAC: 0
Message 6295 - Posted: 3 Mar 2005, 1:15:42 UTC
Last modified: 3 Mar 2005, 1:18:06 UTC

It is "Write to disk at most every xx seconds. This is the most frequent that you are going to allow the project to write to disk. Each science application writes to disk at whatever checkpoints the science code has. I believe that LHC can write continuously, so it does update every xx seconds if there is any data that needs writing. But apparently not every file is update every time. The number 70800 probably represents the turn number. I don't know what the other number represents.

[EDIT] After reading a couple of other posts, it is apparent that the work is not saved very often. Therefore, the checkpoints (if they are working at all) must be very far apart.


BOINC WIKI
ID: 6295 · Report as offensive     Reply Quote
Profile Ben Segal
Volunteer moderator
Project administrator

Send message
Joined: 1 Sep 04
Posts: 141
Credit: 2,579
RAC: 0
Message 6311 - Posted: 3 Mar 2005, 14:45:07 UTC

Thanks a lot, Thierry and John, for your hard work and careful observations of this "zero CPU" problem. The Sixtrack developers are very grateful to you (and other users in other threads) in pointing them to the likely cause.

Ben Segal / LHC@home
ID: 6311 · Report as offensive     Reply Quote
Profile Thierry Van Driessche
Avatar

Send message
Joined: 1 Sep 04
Posts: 157
Credit: 82,604
RAC: 0
Message 6315 - Posted: 3 Mar 2005, 16:03:15 UTC - in response to Message 6311.  
Last modified: 3 Mar 2005, 16:05:47 UTC

> Thanks a lot, Thierry and John, for your hard work and careful observations of
> this "zero CPU" problem. The Sixtrack developers are very grateful to you (and
> other users in other threads) in pointing them to the likely cause.
>
> Ben Segal / LHC@home

Dear Ben,

I hope with these informations the developers will be capable of tracking the exact reason why this "zero problem" occurs.

Good luck there! and keep all of us informed.

Best regards,

Thierry
ID: 6315 · Report as offensive     Reply Quote
Profile Paul D. Buck

Send message
Joined: 2 Sep 04
Posts: 545
Credit: 148,912
RAC: 0
Message 6320 - Posted: 3 Mar 2005, 19:00:26 UTC
Last modified: 3 Mar 2005, 19:42:50 UTC

Well, it took me longer than I had expected. It turns out that the dates saved by BOINC View changed format and mid horse ... so I had to fix that ...

BUT, I do have more interesting data for what it is worth.

Looking at the data, and correlating it to WU type (using the first 6 letters of the WU name as a discriminator) we get these results:

  SELECT SubString(`ResultName`, 1, 6) ResName,
         count(AppName) MyCount, 
         Floor(`FinalCPUTime`), 
         AppName, 
         `ExitStatus`
    FROM `BOINCViewLog` 
   WHERE     `AppName` = 'sixtrack' 
         AND FinalCPUTime > 0 
  GROUP BY ResName
Union
  SELECT SubString(`ResultName`, 1, 6) ResName,
         count(AppName) MyCount, 
         Floor(`FinalCPUTime`), 
         AppName, 
         `ExitStatus`
    FROM `BOINCViewLog` 
   WHERE     `AppName` = 'sixtrack' 
         AND FinalCPUTime = 0 
  GROUP BY ResName
ORDER BY  ResName

ResName MyCount FinalCPU App            ExitStatus
v64bbe  1  	1641  	 sixtrack  	0
v64boi 	16 	2844 	 sixtrack 	0
v64boi 	3 	0 	 sixtrack 	0
v64lhc 	3056 	3037 	 sixtrack 	0
v64lhc 	385 	0 	 sixtrack 	0
v64tun 	68 	8867 	 sixtrack 	0
v64tun 	8 	0 	 sixtrack 	0


If we change it to "ORDER BY appversion, ResName", we get:

v64lhc10  	2991  	3037  	sixtrack  	0  	4.45
v64lhc10 	352 	0 	sixtrack 	0 	4.45
v64bbe6i 	1 	1641 	sixtrack 	0 	4.46
v64boinc 	16 	2844 	sixtrack 	0 	4.46
v64tunes 	68 	8867 	sixtrack 	0 	4.46
v64tunes 	8 	0 	sixtrack 	0 	4.46
v64lhc94 	2 	4462 	sixtrack 	0 	4.63
v64lhc95 	3 	4015 	sixtrack 	0 	4.63
v64lhc95 	1 	0 	sixtrack 	0 	4.63
v64lhc96 	23 	3656 	sixtrack 	0 	4.63
v64lhc96 	6 	0 	sixtrack 	0 	4.63
v64lhc97 	7 	2595 	sixtrack 	0 	4.63
v64boinc 	3 	0 	sixtrack 	0 	4.64
v64lhc86 	1 	4058 	sixtrack 	0 	4.64
v64lhc86 	2 	0 	sixtrack 	0 	4.64
v64lhc88 	2 	4190 	sixtrack 	0 	4.64
v64lhc88 	2 	0 	sixtrack 	0 	4.64
v64lhc89 	5 	298 	sixtrack 	0 	4.64
v64lhc89 	5 	0 	sixtrack 	0 	4.64
v64lhc90 	5 	7 	sixtrack 	0 	4.64
v64lhc90 	2 	0 	sixtrack 	0 	4.64
v64lhc91 	1 	3983 	sixtrack 	0 	4.64
v64lhc91 	1 	0 	sixtrack 	0 	4.64
v64lhc92 	14 	24 	sixtrack 	0 	4.64
v64lhc92 	13 	0 	sixtrack 	0 	4.64
v64lhc93 	2 	2175 	sixtrack 	0 	4.64
v64lhc93 	1 	0 	sixtrack 	0 	4.64



Now, what was most interesting to me was as I was gathering the data I was watching on LHC WU complete. It came up to 15 hours 4 Min and some odd seconds. BUt then it hit 100% and the time reported for CPU time was 14 hours 41 minutes and change ... (Result: 316488).

So, the problem may have some other attributes. In this specific case it lopped off 9 plus minutes (closer to 14 minutes because I think it was 15 hours 4 min and some odd seconds).

In any case, this would make sense in the context that the SHORTER work units are most affected. That is the sense I got out of the numbers that I ran today. So, if it is a checkpointing problem, that would show up both in the restarts and the completions.

If the completed time is handled as the last checkpointed time, and the WU is not checkpointed at completion, that would show as a truncated time. In the case of short WU that were longer than 0, but less than checkpoint will show up as zero ...

[edit]
I just had another 0 time, and it was a v64boinc WU, but the interesting thing is that I watched it download, start, and run for a bit. I was doing other things so, I don't know how far it got before it died, but it does look like it did not run all that long, certainly less than 30 minutes ... NOW I am wondering if the LHC@Home checkpoint interval is 15 minutes or so (20?) ... that again explains the "truncation" of the run time ...


ID: 6320 · Report as offensive     Reply Quote
Profile Ben Segal
Volunteer moderator
Project administrator

Send message
Joined: 1 Sep 04
Posts: 141
Credit: 2,579
RAC: 0
Message 6323 - Posted: 3 Mar 2005, 22:33:01 UTC - in response to Message 6320.  

> ...
> In any case, this would make sense in the context that the SHORTER work units
> are most affected. That is the sense I got out of the numbers that I ran
> today. So, if it is a checkpointing problem, that would show up both in the
> restarts and the completions.
>

Agreed.

> If the completed time is handled as the last checkpointed time, and the WU is
> not checkpointed at completion, that would show as a truncated time. In the
> case of short WU that were longer than 0, but less than checkpoint will show
> up as zero ...
>
> [edit]
> I just had another 0 time, and it was a v64boinc WU, but the interesting thing
> is that I watched it download, start, and run for a bit. I was doing other
> things so, I don't know how far it got before it died, but it does look like
> it did not run all that long, certainly less than 30 minutes ... NOW I am
> wondering if the LHC@Home checkpoint interval is 15 minutes or so (20?) ...
> that again explains the "truncation" of the run time ...
>

Well, Paul, though it's not so simple you could be right: Sixtrack _tries_ to checkpoint regularly and frequently - every 100 turns (for a 100K turn job) and every 1000 turns (for a million turn job). When Sixtrack runs in a non-BOINC environment we see no problems at all with checkpointing. But under BOINC, Sixtrack can only _request_ checkpointing and the BOINC interface decides if a checkpoint is actually done or not. So under BOINC, checkpoint intervals are variable and could well be as long as 15-20 minutes. But some cases we see would imply that no checkpoint had been done at all, or (more likely?) that the checkpoint file is deleted too quickly at job termination, before the final CPU time is read out.

We suspect we have problems with the (recently changed) BOINC API used to build Sixtrack for LHC@home, and are taking a close look at this area.

Thanks again for your interest and constructive help!

Ben
ID: 6323 · Report as offensive     Reply Quote
Profile Paul D. Buck

Send message
Joined: 2 Sep 04
Posts: 545
Credit: 148,912
RAC: 0
Message 6327 - Posted: 4 Mar 2005, 2:17:22 UTC - in response to Message 6323.  
Last modified: 4 Mar 2005, 2:20:17 UTC

> Well, Paul, though it's not so simple you could be right: Sixtrack _tries_ to
> checkpoint regularly and frequently - every 100 turns (for a 100K turn job)
> and every 1000 turns (for a million turn job). When Sixtrack runs in a
> non-BOINC environment we see no problems at all with checkpointing. But under
> BOINC, Sixtrack can only _request_ checkpointing and the BOINC interface
> decides if a checkpoint is actually done or not. So under BOINC, checkpoint
> intervals are variable and could well be as long as 15-20 minutes. But some
> cases we see would imply that no checkpoint had been done at all, or (more
> likely?) that the checkpoint file is deleted too quickly at job termination,
> before the final CPU time is read out.
>
> We suspect we have problems with the (recently changed) BOINC API used to
> build Sixtrack for LHC@home, and are taking a close look at this area.
>
> Thanks again for your interest and constructive help!
>
Ben,

I did play with the program and check point interval and my original was 1800 seconds (I did that as I was assuming no restarts on my computers as a normal situation) and with that, if I killed and restarted BOINC I would see it reset to 0 as expected.

Changing to 60 seconds and doing stops and restarts in one minute intervals seemed to also checkpoint right. At least on that one machine, and so far.

Since I seem to be coming out of my "funk", at least as far as this goes ... made me re-attach all my machines and I will load the logs later tonight or tomorrow and take another look to see if I see anything of interest.

I seem to be working better today as far as this goes. And with the 60 second allowance it seems to be working, which is a smilie or frown depending on what your really wish.

Ben, one more thought; you might try to make a special build of your program to write to a text file independently of the BOINC interface to capture the run time/check point request times ... some of us I am sure would be more than happy to run this on one or more systems to see if we can nail down what is going on with the application...
ID: 6327 · Report as offensive     Reply Quote
Profile Thierry Van Driessche
Avatar

Send message
Joined: 1 Sep 04
Posts: 157
Credit: 82,604
RAC: 0
Message 6360 - Posted: 4 Mar 2005, 18:47:28 UTC
Last modified: 4 Mar 2005, 19:00:01 UTC

I made a new discovery:

2 WU's from the type v64boince6ib1-xxx were crunching. For one of these WU's, files were overwritten in the slots on a regular time, each minute, NOT for the other WU.

I shut down Boinc. At that moment, most of the files in the slots of both WU's were overwritten.

Then I started over again. Both WU's were crunching again, restarting at the right CPU time, BUT now for BOTH WU's the files were overwritten each minutes.

So, how comes the fact that after restarting Boinc in both slots the files are now overwritten and not before???

Best greetings from Belgium
Thierry
ID: 6360 · Report as offensive     Reply Quote
Profile Paul D. Buck

Send message
Joined: 2 Sep 04
Posts: 545
Credit: 148,912
RAC: 0
Message 6361 - Posted: 4 Mar 2005, 19:04:05 UTC - in response to Message 6360.  

> I made a new discovery:
>
> 2 WU's from the type v64boince6ibxxx were crunching. For one of these WU's,
> files were overwritten in the slots on a regular time, each minute, NOT
> for the other WU.
>
> I shut down Boinc. At that moment, most of the files in the slots of both WU's
> were overwritten.
>
> Then I started over again. Both WU's were crunching again, restarting at the
> right CPU time, BUT now for BOTH WU's the files were overwritten
> each minutes.
>
> So, how comes the fact that after restarting Boinc in both slots the
> files are now overwritten and not before???

Are you running Dual processors or HT machines? I wonder if that is a common point? Hmm, hard for me to check well ... nope, just as bad though, my one non-HT machine has a fairly high number of failures ... so, not just, dual processors ... rats ...

It still can be bad file handle management ... possibly on the part of BOINC?
ID: 6361 · Report as offensive     Reply Quote
Profile Thierry Van Driessche
Avatar

Send message
Joined: 1 Sep 04
Posts: 157
Credit: 82,604
RAC: 0
Message 6363 - Posted: 4 Mar 2005, 19:26:01 UTC - in response to Message 6361.  

> Are you running Dual processors or HT machines? I wonder if that is a common
> point? Hmm, hard for me to check well ... nope, just as bad though, my one
> non-HT machine has a fairly high number of failures ... so, not just, dual
> processors ... rats ...
>
> It still can be bad file handle management ... possibly on the part of BOINC?

Hi Paul,

I'm running a HT CPU.

Best regards,

Thierry
ID: 6363 · Report as offensive     Reply Quote
Profile Paul D. Buck

Send message
Joined: 2 Sep 04
Posts: 545
Credit: 148,912
RAC: 0
Message 6379 - Posted: 4 Mar 2005, 23:38:58 UTC - in response to Message 6363.  

> Hi Paul,

Hi Thierry!

> I'm running a HT CPU.

Me too on 4 machines. But I have one single thread processor in my 5 and it has 0 time Results ... so, well, if it was simple anyone could play ...

> Best regards,

And my best to you ...
ID: 6379 · Report as offensive     Reply Quote

Message boards : Number crunching : Write to disk NOT honored ! !


©2025 CERN