Message boards : Number crunching : New WU -- no check point.
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
bass4lhc

Send message
Joined: 28 Sep 04
Posts: 43
Credit: 249,962
RAC: 0
Message 6953 - Posted: 10 Apr 2005, 23:31:50 UTC - in response to Message 6934.  

> I just discovered this: If you let units stay in memory when suspended, they
> won't restart after having been suspended. Strangely enough, they also
> continue after the machine has been restarted.
> The setting is found under "Your Account" -> "General Preferences" ->
> "Leave applications in memory while preempted?".
>
> Hope this helps you, it certainly helped me.
>
thank you, this seems to help.
lhc still has a problem but the software overhere runs as it should.
again, thanks

ID: 6953 · Report as offensive     Reply Quote
Profile The Gas Giant

Send message
Joined: 2 Sep 04
Posts: 309
Credit: 715,258
RAC: 0
Message 6959 - Posted: 11 Apr 2005, 10:26:28 UTC - in response to Message 6944.  
Last modified: 11 Apr 2005, 10:27:00 UTC

> When we checkpoint depends on the specific study and sixtrack version, but i
> believe it is about every 1000 turns.
>
> BUT!!! We only checkpoint if the boinc_time_to_checkpoint() function returns
> true.
> This function is there to limit how often we write to disk, so that people on
> laptops, for example, can save power.
>
> How often we can checkpoint is therefore something that you can limit in your
> preferences.
>

Chrulle,

If someone is running BOINC on a laptop the battery will last about 20 to 30 minutes since the CPU and fan will be running flat out (trust me I've tried it and the battery was only about 2 months old), so a few extra writes to disk is really not going to affect anything. I believe it is better to write to disk every few minutes maximum while the application is running so as to avoid re-doing the work when you shut down due to low battery. There will be very few people running on batteries, most will suspend when running on batteries.

Is there a way to over ride the boinc_time_to_checkpoint() function if you are running on mains power 24/7?

Live long and crunch.

Paul
(S@H1 8888)
BOINC/SAH BETA
ID: 6959 · Report as offensive     Reply Quote
Profile Chrulle

Send message
Joined: 27 Jul 04
Posts: 182
Credit: 1,880
RAC: 0
Message 6960 - Posted: 11 Apr 2005, 10:36:03 UTC

No there is no way to override the boinc_time_to... call.
We could ignore it, but then we would not be following the specfications from Berkeley.

But, every user can set the time themselves. Under "your account" - general preferences, you can set the "write to disk at most every" value to suit you.

Chrulle
Research Assistant & Ex-LHC@home developer
Niels Bohr Institute
ID: 6960 · Report as offensive     Reply Quote
Mark Rush

Send message
Joined: 1 Oct 04
Posts: 5
Credit: 1,692,856
RAC: 1
Message 6962 - Posted: 11 Apr 2005, 19:07:38 UTC - in response to Message 6960.  

Chrulle (et. al.)

What do you suggest is a reasonable time to set for "write to disk" to avoid this problem? I know that on one of my machines it was a MAJOR problem that caused me to eventually detach it from LHC. (And that machine was a 3.6 Pentium... so you probably want it attached! :) )

Mark

> No there is no way to override the boinc_time_to... call.
> We could ignore it, but then we would not be following the specfications from
> Berkeley.
>
> But, every user can set the time themselves. Under "your account" - general
> preferences, you can set the "write to disk at most every" value to suit you.
>
>
ID: 6962 · Report as offensive     Reply Quote
Profile Chrulle

Send message
Joined: 27 Jul 04
Posts: 182
Credit: 1,880
RAC: 0
Message 6965 - Posted: 12 Apr 2005, 7:36:19 UTC
Last modified: 12 Apr 2005, 9:18:52 UTC

Well. I have it set at about 1 minute, but i'll suggest something on the order of 5 minutes.

Can someone who is having the problem send us all the output files?

When you have seen a lhc workunit do a reset, let it run for a while. Close to where it normally does a reset, or until another application is switched in. Then go to your BOINC directory find the "slots" subdirectory. There will be a number of subdirectories in there. Named with numbers of 0 and up. Find the one that contains the sixtrack.exe file. In that directory there should also be a boatload of fort.(number) files. Pack all these files into a zip file and send them to us.
Then we will take a look at the problem.

Chrulle
Research Assistant & Ex-LHC@home developer
Niels Bohr Institute
ID: 6965 · Report as offensive     Reply Quote
Profile littleBouncer
Avatar

Send message
Joined: 23 Oct 04
Posts: 358
Credit: 1,439,205
RAC: 0
Message 6966 - Posted: 12 Apr 2005, 7:51:36 UTC - in response to Message 6965.  
Last modified: 12 Apr 2005, 7:52:02 UTC

@ Chrulle
> Pack all these files into a zip file and
> send them to us.
-----

That sounds easy, but where do we send the 'zip-files'?
Is there a FTP-Server?
When yes, which URL?

> Then we will take a look on the problem.
>
>
That would be better, than to try to explain the problem in english (For non-english-speaking PPL).

Thanks for the offer
littleBouncer

ID: 6966 · Report as offensive     Reply Quote
Profile FalconFly
Avatar

Send message
Joined: 2 Sep 04
Posts: 121
Credit: 592,214
RAC: 0
Message 6967 - Posted: 12 Apr 2005, 9:01:15 UTC - in response to Message 6966.  
Last modified: 12 Apr 2005, 10:36:19 UTC

I seem to have the same Problem.

I have a whole Bunch left running, but I noted none of them actually finished within the last ~3 days.

I didn't have time to take a close look, but I would guess they are permanently restarting at 0 (Checkpoint resets to 0 or none at all), thus they never make it within 60 Minutes.

V4.63, V4.64 and V4.66 ones are running, which I don't quite understand (at least the old ones should finish IMHO, didn't have that Problem before)

They get their CPU shares, but every day I look at them, they're still in the normal cycle with CPU times mostly somewhere below 1 hour, only one shows 1h20m right now.



For now I have increased the "Switch between Projects" from 60 to 90 Minutes.
From the looks, increasing that time should be a suitable workaround at least for the LHC Units that can finish in that timeframe.

--- edit ---
Hm, I've set it now to 720min (12 hours), as to force the Problematic WorkUnits to finish without switching between projects while they're running.

I do hope the next batches of LHC Units do not have this Problem, 12 hours is "somewhat" pushing the limits ;)
Scientific Network : 45000 MHz - 77824 MB - 1970 GB
ID: 6967 · Report as offensive     Reply Quote
Profile FalconFly
Avatar

Send message
Joined: 2 Sep 04
Posts: 121
Credit: 592,214
RAC: 0
Message 6976 - Posted: 12 Apr 2005, 18:06:14 UTC - in response to Message 6967.  
Last modified: 12 Apr 2005, 18:06:34 UTC

[yoda]...dirty workaround it is... dirty indeed[/yoda]

...but it does the Trick, got a few of the showstoppers through already by setting the 12hrs cycle :)
Scientific Network : 45000 MHz - 77824 MB - 1970 GB
ID: 6976 · Report as offensive     Reply Quote
Profile adrianxw

Send message
Joined: 29 Sep 04
Posts: 187
Credit: 705,487
RAC: 0
Message 6993 - Posted: 13 Apr 2005, 15:55:53 UTC
Last modified: 13 Apr 2005, 16:22:03 UTC

It would seem that I also have a "looper". This unit keeps resetting. It does not reset every time, currently I just watched it switch from LHC to P@H and the LHC unit still shows 00:59:35 CPU Time, and 11.34% complete. This morning however, there was ~01:58:00 CPU time used.

It is crunching with SixTrack 4.64 on a 4.19 client. I have the option to "Keep in Memory" enabled, as it seemed to help the zero credit problem. I will keep the unit, and just set LHC's CPU share to a very low level in case anyone there is interested in me trying anything. If I don't hear anything after a while, I'll dump it.

I have zipped all the Fortran files - where should I send them?

Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 6993 · Report as offensive     Reply Quote
Gaspode the UnDressed

Send message
Joined: 1 Sep 04
Posts: 506
Credit: 118,619
RAC: 0
Message 6994 - Posted: 13 Apr 2005, 16:33:21 UTC
Last modified: 13 Apr 2005, 16:34:07 UTC

We seem to have two threads running on this topic of checkpointing. See Ben Segal's notes here on the issue.

My best suggestion is to reset the project, dumping any 4.64 WUs and freeing them up for a 4.67 WU in a few days.

Gaspode the UnDressed
http://www.littlevale.co.uk
ID: 6994 · Report as offensive     Reply Quote
Profile adrianxw

Send message
Joined: 29 Sep 04
Posts: 187
Credit: 705,487
RAC: 0
Message 6995 - Posted: 13 Apr 2005, 18:23:30 UTC

Fine, so in one thread we have a CERN guy saying dump them, and in this thread we have an LHC@Home person asking us to supply information.

Can anyone say "farce"?

Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 6995 · Report as offensive     Reply Quote
ric

Send message
Joined: 17 Sep 04
Posts: 190
Credit: 649,637
RAC: 0
Message 6999 - Posted: 13 Apr 2005, 20:16:57 UTC - in response to Message 6995.  
Last modified: 13 Apr 2005, 21:18:59 UTC

My best suggestion is to reset the project, dumping any 4.64 WUs and freeing them up for a 4.67 WU in a few days

thats you point of view, accepted but I dont share them.

I think thats a *bad* way.

If my understanding is correct, not ALL WUs, based on application verion 4.64 are *bad* and will be "broken"

Only in case of a paused/restart situation, this problem might occurs:

For example whe having attached the client(s) to more than one project AND there is work for. Or whe the client has to be restarted anyway the reason.

If all those circumstances are NOT given, just let the client crunch the 4.64 down and you will earn the credit for. Some will run longer some shorter, see LB notes)

Just detach a project, it's only the last way to manage the problem.

Lucky you are when you run an boinc client, able to suspend all other projects than LHC, at least until the 4.64 based WUs are crunched and away

(this function is supported by boinc client 4.2x and most alpha versions)
(speaking m$)

on the other way, and thats amazing, I got several LHC WUs from the 4.64 generation, they run, preemting/pausing while other projects ( predictor/einstein/ LHC Alpha/pirates) ARE attached and work is there.

So it can't be spoken in generaly, that every 4.64 offers *bad moments*.
It looks like more, in which circumstances/environments its executed.

I do understand people's frustration, for me, there is no valid reason to detach from LHC when the client is stil having work to complete.

Perhaps wrong, always learning, in my eyes, the detaching from a project will slowdown the validation strongly, it has to be waited until deadline arrives until the serverparts of boinc knows that this work will never return. Only after that, the work will be reput to the dl queue.

If 2 fellow cruncher did the job well (or had more luck..) they have to wait
days/weeks until their effort is granted, for example.

When you reset/detach, most of the time, there is NO WORK to download and the painfull wait restarts.....

Please include in your thinkings/descisions, you are here to help the science,
they need you. They need your work. To detach, is not the effective way to help. Basically its the users time and effort wasted when the projects are resetted..

For closing, I would like to invite the "friends of detaching" to reflect, to do everything possible else than only a reset/detach.


happy and sucessfully crunching!

_______________________________________________________________
are you the slave of your PC or is the PC your slave?
ID: 6999 · Report as offensive     Reply Quote
Profile Paul D. Buck

Send message
Joined: 2 Sep 04
Posts: 545
Credit: 148,912
RAC: 0
Message 7002 - Posted: 13 Apr 2005, 21:16:01 UTC

The reason for the seeming "randomness" is that it boiled down to a timing problem.

When you have loosly-coupled message-passing asynchronous systems this is the interesting types of problems you see.
ID: 7002 · Report as offensive     Reply Quote
Profile Ben Segal
Volunteer moderator
Project administrator

Send message
Joined: 1 Sep 04
Posts: 139
Credit: 2,579
RAC: 0
Message 7003 - Posted: 13 Apr 2005, 21:45:09 UTC - in response to Message 6993.  

Profile adrianxw

Send message
Joined: 29 Sep 04
Posts: 187
Credit: 705,487
RAC: 0
Message 7011 - Posted: 14 Apr 2005, 8:43:42 UTC

>>> to try and learn something from a 4.64

I know, I'm a software developer, (Fortran for 15+ years incidently although mostly C/C++ now!).

The farce I was referring too was the fact that we seemed to be getting different advice from different CERN people in different threads.

I had seen Chrulle, quite reasonably, ask people to keep the fortran output in this thread, I had done so, had offered to keep the wu at low priority so that any debugghing could be tried, (where indeed, it still is), etc.

Someone else came along and pointed out that in the other thread, we were being advised by CERN to get rid of 4.64. Now, it is my belief, and experience with the other BOINC projects, that I don't have to do anything to update the project client, it seems to "just arrive".

I am quite happy to post the zipped fortran output, and as I said before, will keep the wu until such times as someone tells me to get rid of it. It is not causing me any problems since there is no new work from LHC anyway.

Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 7011 · Report as offensive     Reply Quote
Profile Markku Degerholm

Send message
Joined: 3 Sep 04
Posts: 212
Credit: 4,545
RAC: 0
Message 7066 - Posted: 18 Apr 2005, 15:09:03 UTC - in response to Message 7011.  

> Someone else came along and pointed out that in the other thread, we were
> being advised by CERN to get rid of 4.64. Now, it is my belief, and experience
> with the other BOINC projects, that I don't have to do anything to update the
> project client, it seems to "just arrive".

Was it me? If so, with 'update' I meant to wait until the client gets updated along with the new workunits. It really should happen automatically... But of course it could be though as an advise to do manual update in the boinc manager. Which isn't a catastrophe either. If a few results get lost, we can resubmit them or compute them using clusters in CERN.

Anyway, sorry about possible confusion...

Markku Degerholm
LHC@home admin
ID: 7066 · Report as offensive     Reply Quote
Previous · 1 · 2

Message boards : Number crunching : New WU -- no check point.


©2024 CERN