1) Message boards : News : BOINC Server up (Message 27653)
Posted 10 Dec 2015 by Profile Ananas
Post:
this website isnot active

How did you post then?

p.s.: You can get informations here too : http://lhcathome.web.cern.ch/projects/sixtrack, don't forget to explore the links on the right side of that page.
2) Message boards : Number crunching : w-b3_0_job. killing everything (Message 26922)
Posted 22 Oct 2014 by Profile Ananas
Post:
At 100%, when the WU writes massive ammounts of those "fort" files, the i/o is so heavy that the BOINC core client gets no control anymore and "forgets" to update the heartbeat timestamp - for several minutes (while the CPU efficiency goes down to ~10%).

Before they are at 100%, they behave quite normal (for ~45 minutes on my C2Q windows box) but when they reach 100%, they block everything for several minutes, each HDD access on the PC becomes extremely sluggish and all BOINC tasks (including the LHC job itself) throw heatbeat errors for several minutes, while the CPU time on w-b3_0_job nearly does not increase at all.

It is possible that it would work on SCSI HDDs with a huge cache and on SSDs, but for running on a "normal" PC this needs a redesign. It might become even worse if a virus scanner would try to grab and check those files but the single(!) w-b3_0_job was already too much for my box - I guess if 4 of those would have been running at 100%, I wouldn't even have had a chance to abort them.

p.s.: A Linux wingman seems to have survived the workunit that caused the trouble for me, so it might be windows related. *ix systems tend not to block everything when i/o activity is pending. Maybe you should brand them to be sent solely to Linux
3) Message boards : Number crunching : Sixtrack has heartbeat failure -- hangs (Message 26841)
Posted 4 Oct 2014 by Profile Ananas
Post:
Heartbeat errors are often a result of excessive HDD i/o. Running one workunit with more than 30 files constantly beeing written seems to be something a system still can survive easily. With four concurrent workunits, each one writing a lot of stuff to HDD - especially if it has to jump between 128 or more open files permanently - the load renders the HDD write cache useless.

If BOINC now tries to evaluate the HDD usage (in order to check the limits) or it just wants to copy and rewrite client_state.xml, it might end up at the end of quite a long HDD access queue. While it is stuck in that queue, it does nothing else than wait and it has no chance to refresh the timestamp in shared memory.

I have a similar effect when I run too many concurrent Malaria WUs, which write a lot of stuff for checkpointing (it seems they are even using the zlib library functions gzopen/gzwrite). For Malaria it helped to reduce the checkpoint frequency *) - but there is a major disadvantage : If it does not fix the heartbeat bug, your results will be thrown back much further on each restart, so watch your host carefully for a while after you changed the setting.

*) e.g. "Tasks checkpoint to disk at most every 600 seconds"
4) Message boards : Number crunching : Tasks exceeding disk limit (Message 26761)
Posted 2 Oct 2014 by Profile Ananas
Post:
Yepp, that's why I came here too - and while watching what happens, I understood, why the CPU usage is so low in the startup phase : it creates and grows a lot of "fort.##" files simultanously, probably causing I/O-waits.

Aborting task w-b3_-26000_job.HLLHC_b3_-26000.0732__1__s__62.31_60.32__15_17__5__28.2354_1_sixvf_boinc116_1: exceeded disk limit: 272.38MB > 190.73MB

One current result has 72 fort.## files, 32 of them keep growing fast (~5MB each at the moment), total slot size currently ~160MB so it will sure crash soon.

update : 6.3MB each, 200MB total slot (which already violates the limit)
5) Message boards : Number crunching : Abnormally short WU times (Message 26753)
Posted 29 Sep 2014 by Profile Ananas
Post:
There is one weird thing about those very short workunits : they run with only a small percentage (~7%) of CPU usage most of the time, that means, a result that reports 1.5 minutes CPU time to the core client actually ran up to 10 minutes.

Longer LHC CPUs do that too, but only in the startup phase, so the inefficient part has less effect on the total runtime.

Unlike some other projects, they do not increase the system time, which would still cause the CPU to be fully loaded (without beeing counted into the WU's CPU time), the CPU cores are just nearly idle in the first few minutes of LHC results, just as if the WUs would just take a nap.
6) Message boards : Number crunching : Host messing up tons of results (Message 26692)
Posted 19 Jul 2014 by Profile Ananas
Post:
The problem with host 10137504 lays in BOINC itself, the server side BOINC software does not really reduce the host's daily quota unless it had less than 2%(!!!) * valid results. But host 10137504 does return a valid result now and then.

I have reported this problem in several projects that had a similar problem but it seems not to be fixed.

* The quota works like this :
Invalid => Quota -= 1
Valid => Quota *= 2

but would better be :

Invalid => Quota /= 2
Valid => Quota += 1

You can exclude a host completely by setting the quota to -1 by hand, in this case any scheduler contact will be rejected. But in this case, it will not be able to report even the results it already has anymore.
7) Message boards : Number crunching : Windows XP hosts do not receive work! (Message 26640)
Posted 9 Jul 2014 by Profile Ananas
Post:
No trouble with my XP boxes so far, so it is probably not directly dependant on the OS version.
8) Message boards : Number crunching : Host messing up tons of results (Message 26637)
Posted 9 Jul 2014 by Profile Ananas
Post:
10137504 currently has 13152 inconclusive results and 16 valid ones.
9) Message boards : Number crunching : Server status page (Message 26546)
Posted 24 May 2014 by Profile Ananas
Post:
The status page has a delay in most projects, some are even so slow that you get a browser timeout now and then and - even though it should be cached - it goes down when the database is down.

The first step I would take would be to separate the actual Server status and the Computing status or switch off this or that information block and see how the behaviour changes.

Some projects have parts of the server status page integrated into their start page, so I guess those are the parts that work without delay.


A lazy option (but still a good one) would be to set up a cron driven wget job that downloads the dynamic server status page into a static snapshot and link that snapshot to the start page. For the cron job, 15 minutes should be fine.
10) Message boards : Number crunching : Stuck validation inconclusive (Message 26539)
Posted 24 May 2014 by Profile Ananas
Post:
I received a few redelivered ones a few hours ago, they can easily be recognized because most have a shorter deadline (download errors seem not to reduce the deadline) and all do not end with _0 or _1.
11) Message boards : Number crunching : Stuck validation inconclusive (Message 26535)
Posted 24 May 2014 by Profile Ananas
Post:
Not a problem on your host, your wingman has already been reported in this posting.

At some point your workunits will be delivered a third time and when those come back, your results should validate.

The server side scheduler seems to place redeliveries at the end of the queue and with the currently quite long running workunits it might take some time until they moved to the top of the queue.
12) Message boards : Cafe LHC : Computing Preferences (Message 26529)
Posted 23 May 2014 by Profile Ananas
Post:
"Use at most: 66% of CPU time" only if you have heat issues. You can increase it if you wish.

The HDD has plenty of space, even ClimatePrediction would be happy there.


There is one setting that I usually increase : "Tasks checkpoint to disk at most every (default=60) seconds". A reliable system should not need that many checkpoints, checkpointing interrupts the calculation and slows down the crunching. My current setting is 10 minutes (600 seconds) but I already ran 3600 seconds without problems.

If it later shows that your tasks tend to throw heartbeat errors, you can still decrease the checkpoint period. Heartbeat problems depend very much on the projects (and other programs) that you run concurrent to LHC. If you run LHC and Einstein, the heartbeat watchdog should hardly ever interrupt your results.

The only thing the higher setting would interfer with would be if you had "Leave tasks in memory while suspended? NO" but as you have "YES" there, this does not apply to you.
13) Message boards : News : Three Problems, 22nd May. (Message 26528)
Posted 23 May 2014 by Profile Ananas
Post:
... Looks OK now ...

I don't think so. Check the pending ones, all the results with a CPU time less than a second seem to be damaged (stderr contains nothing but the core client version), 1500+ damaged ones pending from today (and maybe 10 not damaged ones).

They just don't occur in the "inconclusive" list yet because the wingmen didn't return their share yet and the validator didn't touch them yet.

Maybe a heat issue, a quadcore laptop with hyperthreading - the other hosts of the same user do not show any similar issues so it is most likely not a virus scanner that interfers with the application.
14) Message boards : News : Three Problems, 22nd May. (Message 26523)
Posted 23 May 2014 by Profile Ananas
Post:
...
I shall have a look at "unsticking". Eric.

An attempt to fix the two database problems I know about :

DELETE FROM result
  WHERE NOT EXISTS (
    SELECT workunitid.id
      FROM workunit  
      WHERE workunit.id = result.workunitid
  );
UPDATE result SET outcome = 2
  WHERE outcome = 4
  AND 0 < (
    SELECT workunitid.canonical_resultid
      FROM workunit
      WHERE workunitid.id = result.workunitid
      AND workunitid.need_validate = 0
  );


I'm not totally sure that the second one does what we want it to do so better "SELECT" (and check a few samples) before "UPDATE".
If MySQL handles NUL > 0, it will not work properly.

It might also be necessary to manipulate result.server_state too, unfortunately there is not a single status attribute for the result.

If it becomes too complex ... all those results will sooner or later become orphaned so the "DELETE" SQL statement will catch them ;-)

#################################

p.s.: There are two types of invalid results. Some run only seconds and return an incomplete stdout (host 10137504 returns tons of those)
15) Message boards : News : Three Problems, 22nd May. (Message 26521)
Posted 23 May 2014 by Profile Ananas
Post:
I have inconclusive results on 3 hosts, only one of them is mildly OCed.

Actually several of them are invalid, as the two other results have already been validated. Mine stick to "inconclusive" - the transitioner (I think that's the one that is supposed to do it) "forgot" to switch the state into "invalid".

All invalids (without exception) have one thing in common : the results ran less than a minute on my boxes (credit range 0.04 - 0.06). For the valid partners, the runtimes vary.
16) Message boards : Number crunching : Invalid tasks (Message 26469)
Posted 17 May 2014 by Profile Ananas
Post:
Unusual : All boxes ran windows and for some reason mine always picked SSE3, where the others picked PNI ... but otoh., I patched my clients to report SSE3 (5.10.28 didn't know that extension yet)

Usual. 'SSE3' and 'Prescott New Instructions' are synonyms, and the applications are identical.

Yes, I already learned that here - it was just surprising that mine always picked the sse3, whereas others picked pni ... until I remembered my core client patch
17) Message boards : Number crunching : Invalid tasks (Message 26466)
Posted 17 May 2014 by Profile Ananas
Post:
2 inconclusive ones :

with SixTrack v451.07 wtest_newnuebb0105__5__s__64.31_59.32__6_8__5__30_1_sixvf_boinc610 waiting for a third result

with SixTrack v451.07 w14_eric_job_tracking_bb_np_nt_fset_240214__13__s__62.31_60.32__10_12__6__82.5_1_sixvf_boinc4540 invalid

In both cases the runtime on my box has been extremely low.

Unusual : All boxes ran windows and for some reason mine always picked SSE3, where the others picked PNI ... but otoh., I patched my clients to report SSE3 (5.10.28 didn't know that extension yet)

18) Message boards : Number crunching : sixtracktest v450.09 (sse3) windows x86 : CreateProcess() failed (Message 26406)
Posted 9 May 2014 by Profile Ananas
Post:
sixtrack_win32_4517_sse3.exe creates output, but not ZIP files on XP.

stderr
01:04:45 (3216): Can't open init data file - running in standalone mode
01:04:46 (3216): called boinc_finish


fort.6
         SIXTRACR VECTOR VERSION 4.5.17   (with tilt)  --  (last change: 09.05.2014)
 
 
SIXTRACR starts on: 10th of   May     2014, 04 minutes after 01.
 
 
 
         ++++++++++++++++++++++++
         +++++ERROR DETECTED+++++
         ++++++++++++++++++++++++
         RUN TERMINATED ABNORMALLY !!!
 
 
         TRACKING PARAMETER FILE (UNIT 3) IS EMPTY OR  NONEXISTING
 SIXTRACR stop                                                   


fort.93
 SIXTRACR starts very first time
 SIXTRACR retry after unzip of Sixin.zip
 
 SIXTRACR MAINCR 
 SIXTRACR starts on: 10th of   May     2014, 04 minutes after 01.               
  
 SIXTRACR STOP/ABEND copying fort.92
 SIXTRACR stop                                                   


(error messages are expected, as I didn't supply an input file)
19) Message boards : Number crunching : sixtracktest v450.09 (sse3) windows x86 : CreateProcess() failed (Message 26400)
Posted 7 May 2014 by Profile Ananas
Post:
The logs are deleted on finishing the result - the upload error is a result of the missing ZIP routine I guess. The short runtime most likely is a result of the missing ZIP module too, it cannot unzip the workunit.
20) Message boards : Number crunching : sixtracktest v450.09 (sse3) windows x86 : CreateProcess() failed (Message 26398)
Posted 7 May 2014 by Profile Ananas
Post:
sixtrack_win32_4515_sse3.exe run (standalone without input file) on XP x86 :

stderr
13:20:19 (7536): Can't open init data file - running in standalone mode
13:20:20 (7536): called boinc_finish


fort.6
 
         SIXTRACR VECTOR VERSION 4.5.15   (with tilt)  --  (last change: 04.05.2014)
 
 
SIXTRACR starts on: 07th of   May     2014, 27 minutes after 13.
 
 
 
         ++++++++++++++++++++++++
         +++++ERROR DETECTED+++++
         ++++++++++++++++++++++++
         RUN TERMINATED ABNORMALLY !!!
 
 
         TRACKING PARAMETER FILE (UNIT 3) IS EMPTY OR  NONEXISTING
 SIXTRACR stop                                                   


fort.93
 SIXTRACR starts very first time
 SIXTRACR retry after unzip of Sixin.zip
 
 SIXTRACR MAINCR 
 SIXTRACR starts on: 07th of   May     2014, 27 minutes after 13.               
  
 SIXTRACR STOP/ABEND copying fort.92
 SIXTRACR stop                                                   

No message boxes about invalid executables here ... but : other than sixtrack_win32_4513_sse3.exe it did not create a .zip file Sixout.zip!

p.s.: sixtrack_win32_4513_sse3.exe contains a module version ID string that sixtrack_win32_4515_sse3.exe does not have :

$Id: boinc_zip.cpp 18195 2009-05-22 21:19:44Z davea $


Next 20


©2024 CERN