New WU -- no check point.

Author	Message
bass4lhc Send message Joined: 28 Sep 04 Posts: 43 Credit: 249,962 RAC: 0	Message 6953 - Posted: 10 Apr 2005, 23:31:50 UTC - in response to Message 6934. > I just discovered this: If you let units stay in memory when suspended, they > won't restart after having been suspended. Strangely enough, they also > continue after the machine has been restarted. > The setting is found under "Your Account" -> "General Preferences" -> > "Leave applications in memory while preempted?". > > Hope this helps you, it certainly helped me. > thank you, this seems to help. lhc still has a problem but the software overhere runs as it should. again, thanks ID: 6953 · Reply Quote

The Gas Giant Send message Joined: 2 Sep 04 Posts: 309 Credit: 715,258 RAC: 0	Message 6959 - Posted: 11 Apr 2005, 10:26:28 UTC - in response to Message 6944. Last modified: 11 Apr 2005, 10:27:00 UTC > When we checkpoint depends on the specific study and sixtrack version, but i > believe it is about every 1000 turns. > > BUT!!! We only checkpoint if the boinc_time_to_checkpoint() function returns > true. > This function is there to limit how often we write to disk, so that people on > laptops, for example, can save power. > > How often we can checkpoint is therefore something that you can limit in your > preferences. > Chrulle, If someone is running BOINC on a laptop the battery will last about 20 to 30 minutes since the CPU and fan will be running flat out (trust me I've tried it and the battery was only about 2 months old), so a few extra writes to disk is really not going to affect anything. I believe it is better to write to disk every few minutes maximum while the application is running so as to avoid re-doing the work when you shut down due to low battery. There will be very few people running on batteries, most will suspend when running on batteries. Is there a way to over ride the boinc_time_to_checkpoint() function if you are running on mains power 24/7? Live long and crunch. Paul (S@H1 8888) BOINC/SAH BETA ID: 6959 · Reply Quote

Chrulle Send message Joined: 27 Jul 04 Posts: 182 Credit: 1,880 RAC: 0	Message 6960 - Posted: 11 Apr 2005, 10:36:03 UTC No there is no way to override the boinc_time_to... call. We could ignore it, but then we would not be following the specfications from Berkeley. But, every user can set the time themselves. Under "your account" - general preferences, you can set the "write to disk at most every" value to suit you. Chrulle Research Assistant & Ex-LHC@home developer Niels Bohr Institute ID: 6960 · Reply Quote

Mark Rush Send message Joined: 1 Oct 04 Posts: 5 Credit: 1,692,856 RAC: 0	Message 6962 - Posted: 11 Apr 2005, 19:07:38 UTC - in response to Message 6960. Chrulle (et. al.) What do you suggest is a reasonable time to set for "write to disk" to avoid this problem? I know that on one of my machines it was a MAJOR problem that caused me to eventually detach it from LHC. (And that machine was a 3.6 Pentium... so you probably want it attached! :) ) Mark > No there is no way to override the boinc_time_to... call. > We could ignore it, but then we would not be following the specfications from > Berkeley. > > But, every user can set the time themselves. Under "your account" - general > preferences, you can set the "write to disk at most every" value to suit you. > > ID: 6962 · Reply Quote

Chrulle Send message Joined: 27 Jul 04 Posts: 182 Credit: 1,880 RAC: 0	Message 6965 - Posted: 12 Apr 2005, 7:36:19 UTC Last modified: 12 Apr 2005, 9:18:52 UTC Well. I have it set at about 1 minute, but i'll suggest something on the order of 5 minutes. Can someone who is having the problem send us all the output files? When you have seen a lhc workunit do a reset, let it run for a while. Close to where it normally does a reset, or until another application is switched in. Then go to your BOINC directory find the "slots" subdirectory. There will be a number of subdirectories in there. Named with numbers of 0 and up. Find the one that contains the sixtrack.exe file. In that directory there should also be a boatload of fort.(number) files. Pack all these files into a zip file and send them to us. Then we will take a look at the problem. Chrulle Research Assistant & Ex-LHC@home developer Niels Bohr Institute ID: 6965 · Reply Quote

littleBouncer Send message Joined: 23 Oct 04 Posts: 358 Credit: 1,439,205 RAC: 0	Message 6966 - Posted: 12 Apr 2005, 7:51:36 UTC - in response to Message 6965. Last modified: 12 Apr 2005, 7:52:02 UTC @ Chrulle > Pack all these files into a zip file and > send them to us. ----- That sounds easy, but where do we send the 'zip-files'? Is there a FTP-Server? When yes, which URL? > Then we will take a look on the problem. > > That would be better, than to try to explain the problem in english (For non-english-speaking PPL). Thanks for the offer littleBouncer ID: 6966 · Reply Quote

FalconFly Send message Joined: 2 Sep 04 Posts: 121 Credit: 592,214 RAC: 0	Message 6967 - Posted: 12 Apr 2005, 9:01:15 UTC - in response to Message 6966. Last modified: 12 Apr 2005, 10:36:19 UTC I seem to have the same Problem. I have a whole Bunch left running, but I noted none of them actually finished within the last ~3 days. I didn't have time to take a close look, but I would guess they are permanently restarting at 0 (Checkpoint resets to 0 or none at all), thus they never make it within 60 Minutes. V4.63, V4.64 and V4.66 ones are running, which I don't quite understand (at least the old ones should finish IMHO, didn't have that Problem before) They get their CPU shares, but every day I look at them, they're still in the normal cycle with CPU times mostly somewhere below 1 hour, only one shows 1h20m right now. For now I have increased the "Switch between Projects" from 60 to 90 Minutes. From the looks, increasing that time should be a suitable workaround at least for the LHC Units that can finish in that timeframe. --- edit --- Hm, I've set it now to 720min (12 hours), as to force the Problematic WorkUnits to finish without switching between projects while they're running. I do hope the next batches of LHC Units do not have this Problem, 12 hours is "somewhat" pushing the limits ;) Scientific Network : 45000 MHz - 77824 MB - 1970 GB ID: 6967 · Reply Quote

FalconFly Send message Joined: 2 Sep 04 Posts: 121 Credit: 592,214 RAC: 0	Message 6976 - Posted: 12 Apr 2005, 18:06:14 UTC - in response to Message 6967. Last modified: 12 Apr 2005, 18:06:34 UTC [yoda]...dirty workaround it is... dirty indeed[/yoda] ...but it does the Trick, got a few of the showstoppers through already by setting the 12hrs cycle :) Scientific Network : 45000 MHz - 77824 MB - 1970 GB ID: 6976 · Reply Quote

adrianxw Send message Joined: 29 Sep 04 Posts: 188 Credit: 705,487 RAC: 0	Message 6993 - Posted: 13 Apr 2005, 15:55:53 UTC Last modified: 13 Apr 2005, 16:22:03 UTC It would seem that I also have a "looper". This unit keeps resetting. It does not reset every time, currently I just watched it switch from LHC to P@H and the LHC unit still shows 00:59:35 CPU Time, and 11.34% complete. This morning however, there was ~01:58:00 CPU time used. It is crunching with SixTrack 4.64 on a 4.19 client. I have the option to "Keep in Memory" enabled, as it seemed to help the zero credit problem. I will keep the unit, and just set LHC's CPU share to a very low level in case anyone there is interested in me trying anything. If I don't hear anything after a while, I'll dump it. I have zipped all the Fortran files - where should I send them? Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. ID: 6993 · Reply Quote

Gaspode the UnDressed Send message Joined: 1 Sep 04 Posts: 506 Credit: 118,619 RAC: 0	Message 6994 - Posted: 13 Apr 2005, 16:33:21 UTC Last modified: 13 Apr 2005, 16:34:07 UTC We seem to have two threads running on this topic of checkpointing. See Ben Segal's notes here on the issue. My best suggestion is to reset the project, dumping any 4.64 WUs and freeing them up for a 4.67 WU in a few days. Gaspode the UnDressed http://www.littlevale.co.uk ID: 6994 · Reply Quote

adrianxw Send message Joined: 29 Sep 04 Posts: 188 Credit: 705,487 RAC: 0	Message 6995 - Posted: 13 Apr 2005, 18:23:30 UTC Fine, so in one thread we have a CERN guy saying dump them, and in this thread we have an LHC@Home person asking us to supply information. Can anyone say "farce"? Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. ID: 6995 · Reply Quote

ric Send message Joined: 17 Sep 04 Posts: 190 Credit: 649,637 RAC: 0	Message 6999 - Posted: 13 Apr 2005, 20:16:57 UTC - in response to Message 6995. Last modified: 13 Apr 2005, 21:18:59 UTC My best suggestion is to reset the project, dumping any 4.64 WUs and freeing them up for a 4.67 WU in a few days thats you point of view, accepted but I dont share them. I think thats a bad way. If my understanding is correct, not ALL WUs, based on application verion 4.64 are bad and will be "broken" Only in case of a paused/restart situation, this problem might occurs: For example whe having attached the client(s) to more than one project AND there is work for. Or whe the client has to be restarted anyway the reason. If all those circumstances are NOT given, just let the client crunch the 4.64 down and you will earn the credit for. Some will run longer some shorter, see LB notes) Just detach a project, it's only the last way to manage the problem. Lucky you are when you run an boinc client, able to suspend all other projects than LHC, at least until the 4.64 based WUs are crunched and away (this function is supported by boinc client 4.2x and most alpha versions) (speaking m$) on the other way, and thats amazing, I got several LHC WUs from the 4.64 generation, they run, preemting/pausing while other projects ( predictor/einstein/ LHC Alpha/pirates) ARE attached and work is there. So it can't be spoken in generaly, that every 4.64 offers bad moments. It looks like more, in which circumstances/environments its executed. I do understand people's frustration, for me, there is no valid reason to detach from LHC when the client is stil having work to complete. Perhaps wrong, always learning, in my eyes, the detaching from a project will slowdown the validation strongly, it has to be waited until deadline arrives until the serverparts of boinc knows that this work will never return. Only after that, the work will be reput to the dl queue. If 2 fellow cruncher did the job well (or had more luck..) they have to wait days/weeks until their effort is granted, for example. When you reset/detach, most of the time, there is NO WORK to download and the painfull wait restarts..... Please include in your thinkings/descisions, you are here to help the science, they need you. They need your work. To detach, is not the effective way to help. Basically its the users time and effort wasted when the projects are resetted.. For closing, I would like to invite the "friends of detaching" to reflect, to do everything possible else than only a reset/detach. happy and sucessfully crunching! _______________________________________________________________ are you the slave of your PC or is the PC your slave? ID: 6999 · Reply Quote

Paul D. Buck Send message Joined: 2 Sep 04 Posts: 545 Credit: 148,912 RAC: 0	Message 7002 - Posted: 13 Apr 2005, 21:16:01 UTC The reason for the seeming "randomness" is that it boiled down to a timing problem. When you have loosly-coupled message-passing asynchronous systems this is the interesting types of problems you see. ID: 7002 · Reply Quote

Ben Segal Volunteer moderator Project administrator Send message Joined: 1 Sep 04 Posts: 141 Credit: 2,579 RAC: 0	Message 7003 - Posted: 13 Apr 2005, 21:45:09 UTC - in response to Message 6993. > It would seem that I also have a "looper". href="http://lhcathome.cern.ch/workunit.php?wuid=85253">This unit[/url] keeps > resetting. It does not reset every time, currently I just watched it switch > from LHC to P@H and the LHC unit still shows 00:59:35 CPU Time, and 11.34% > complete. This morning however, there was ~01:58:00 CPU time used. > > It is crunching with SixTrack 4.64 on a 4.19 client. I have the option to > "Keep in Memory" enabled, as it seemed to help the zero credit problem. I will > keep the unit, and just set LHC's CPU share to a very low level in case anyone > there is interested in me trying anything. If I don't hear anything after a > while, I'll dump it. > > I have zipped all the Fortran files - where should I send them? > Hi adrianxw, Please send them by email to Admin.Lhcathome@cern.ch and we'll take a look. Thanks a lot for your help! We don't think it's a "farce", by the way, to try and learn something from a 4.64 run about how Sixtrack checkpointing may be affected by the BOINC environment (e.g. "Keep in Memory"), even though we believe that 4.66 is better for reliable crunching and we recommend users to adopt it. Best wishes, Ben Segal & Chrulle Soettrup / LHC@home ID: 7003 · Reply Quote

adrianxw Send message Joined: 29 Sep 04 Posts: 188 Credit: 705,487 RAC: 0	Message 7011 - Posted: 14 Apr 2005, 8:43:42 UTC >>> to try and learn something from a 4.64 I know, I'm a software developer, (Fortran for 15+ years incidently although mostly C/C++ now!). The farce I was referring too was the fact that we seemed to be getting different advice from different CERN people in different threads. I had seen Chrulle, quite reasonably, ask people to keep the fortran output in this thread, I had done so, had offered to keep the wu at low priority so that any debugghing could be tried, (where indeed, it still is), etc. Someone else came along and pointed out that in the other thread, we were being advised by CERN to get rid of 4.64. Now, it is my belief, and experience with the other BOINC projects, that I don't have to do anything to update the project client, it seems to "just arrive". I am quite happy to post the zipped fortran output, and as I said before, will keep the wu until such times as someone tells me to get rid of it. It is not causing me any problems since there is no new work from LHC anyway. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. ID: 7011 · Reply Quote

Markku Degerholm Send message Joined: 3 Sep 04 Posts: 212 Credit: 4,545 RAC: 0	Message 7066 - Posted: 18 Apr 2005, 15:09:03 UTC - in response to Message 7011. > Someone else came along and pointed out that in the other thread, we were > being advised by CERN to get rid of 4.64. Now, it is my belief, and experience > with the other BOINC projects, that I don't have to do anything to update the > project client, it seems to "just arrive". Was it me? If so, with 'update' I meant to wait until the client gets updated along with the new workunits. It really should happen automatically... But of course it could be though as an advise to do manual update in the boinc manager. Which isn't a catastrophe either. If a few results get lost, we can resubmit them or compute them using clusters in CERN. Anyway, sorry about possible confusion... Markku Degerholm LHC@home admin ID: 7066 · Reply Quote

LHC@home