1) Message boards : ATLAS application : Download failures (Message 40043)
Posted 22 days ago by xii5ku
Post:
Harri Liljeroos wrote:
Could this explain the sudden spike on number of running jobs in the graphs? I hope so.

Do you mean an increase of tasks in progress?
This is due to a Formula Boinc sprint.
2) Message boards : ATLAS application : Download failures (Message 40030)
Posted 25 days ago by xii5ku
Post:
I'm getting some download errors for tasks that had a timeout on the first Host they were sent to. So files disappeared from server after one week.

I am seeing this too.
All my download failures are tasks of WUs of which an earlier task failed by "Timed out - no response".
3) Message boards : Sixtrack Application : Inconclusive, valid/invalid results (Message 31132)
Posted 27 Jun 2017 by xii5ku
Post:
Common to all listed below is that all results from the other participants were short runs,


My >4500 inconclusives were across the board with runtimes.

I suspect the fact that yours (or your wingmen's) were all short was probably mere coincidence, and a likely coincidence anyway because the SixTrack project generates so many of those. :-(
4) Message boards : Sixtrack Application : Inconclusive, valid/invalid results (Message 31097)
Posted 26 Jun 2017 by xii5ku
Post:
PS:
Spot checks through my >4500 (and rising) inconclusive results show that I and the wingman both completed with exist status 0. So far I have not seen a single WU with non-zero exit code in my or the wingman's task. (IOW I have nowhere seen the exit code 59 which is mentioned in message 31064.)
5) Message boards : Sixtrack Application : Inconclusive, valid/invalid results (Message 31086)
Posted 26 Jun 2017 by xii5ku
Post:
I have doubts that there are special hosts or special OSs causing the high rate of inconclusive results.

Why? Because the rate of inconclusive results is >99.9 % as far as I can see.

My current sixtrack stats:
all (9502)
in progress (755)
validation pending (1708)
validation inconclusive (4527)
valid (1256) --- only 3 after the validator changed, see below
invalid (1)
error (1255)

Comments:

Before the validator change, I think I had only a handful inconclusive results. Browsing through my >4500 current inconclusives, they seem to be all from after the validator change.

The single invalid one is WU 69714861: 2x cancelled + 3x finished but with different results according to the validator (completed on June 3, June 10, and June 25, i.e. 2x with old validator and 1x with new validator). Therefore this invalid task really is more like inconclusive, because there were two guys who cancelled, and it remains unknown which of the three submitted results was the right one.

The errors are some user-aborted tasks, but typically "finish file present too long" errors.

Of the valid tasks, only 3 (three) have been validated by the new validator. All others had been validated before the new validator was brought online.

(BTW, all of my boxes are Xeon E5 and Xeon E3, all but one with ECC RAM, and they had earned my trust in their results before. Some of them are purpose-built compute nodes for engineering applications, doing Distributed Computing in downtimes. --- Edit: These are Linux boxes, except one Windows box which shows exactly the same picture as the Linux boxes.)
6) Message boards : Sixtrack Application : Inconclusive, valid/invalid results (Message 31050)
Posted 25 Jun 2017 by xii5ku
Post:
Ditto.

I had plenty of SixTrack tasks validated up until June 24, 9:09 UTC. Since then, only 3 (three) more validated. All other completed SixTrack tasks are either "validation pending" (1/3 of them) or "validation inconclusive" (2/3 of them), and more tasks are continuing to migrate from pending to inconclusive as we speak.

(Edit: I downloaded SixTrack tasks between Wednesday, June 21 20:16 UTC and Saturday, June 24 14:33 UTC. Inconclusive tasks came from this entire timeframe.)

The new validator appears to put a lot more tasks into "inconclusive" state --- for better or worse.
7) Message boards : Sixtrack Application : exceeded elapsed time limit 30940.80 (180000000.00G/5817.56G) (Message 31024)
Posted 24 Jun 2017 by xii5ku
Post:
@planetclown,
you will know that this kind of trouble is ahead if newly downloaded, not yet started tasks are listed with an estimated time remaining of a few minutes or even less than a minute.

Here is what I do on my clients which are in this situation:

    *have "No new tasks" set while I am away
    *download tasks manually, i.e. "Allow new tasks" + "Update"
    *perhaps even suspend CPU activity while downloading
    *when downloads finished, set "No new tasks"
    *shut down client
    *check with "ps ax|grep boinc" that the client is really down
    *make a backup of client_state.xml
    *search and replace all occurrences of ".000000</rsc_fpops_bound>" by "000000</rsc_fpops_bound>" in client_state.xml
    *restart client, resume CPU activity


This is based on Crystal Pellet's and Juha's posts in this thread.

I also attempted to work around this by editing client_state/ app_version/ flops, but at least one attempt of doing so with some work having been downloaded earlier resulted in almost all tasks erroring right away. Perhaps app_version flops should only be edited while no sixtrack WUs are present on the client.

Good luck in the Formula Boinc sprint. :-)

8) Message boards : Sixtrack Application : exceeded elapsed time limit 30940.80 (180000000.00G/5817.56G) (Message 30984)
Posted 23 Jun 2017 by xii5ku
Post:
Eric, thanks for your reply (I only saw it after I sent my latter post), and thanks for chasing all those interrelated problems. As a layman, I hardly have an idea of the hurdles you are encountering on your way to get this solved eventually.
9) Message boards : Sixtrack Application : exceeded elapsed time limit 30940.80 (180000000.00G/5817.56G) (Message 30983)
Posted 23 Jun 2017 by xii5ku
Post:
Meanwhile, two more of my machines have contracted the same disease as host 10486566: They downloaded a few hundred tasks in one go, and the estimated runtime of these tasks was set to about 10 seconds. Luckily I noticed this before the machines began computing these tasks. (I shut down the clients, edited client_state.xml for 10,000 times larger workunit.rsc_fpops_bound, and restarted the clients.)

Conclusion: You absolutely cannot run Sixtrack exclusively for more than a day on dual-socket machines, unless you are ready for repeated client_state.xml manipulations. (Smaller hosts may take longer to contract this issue, I guess.)

I am looking forward to your fix of Sixtrack's validator.

In addition: Would it be feasible to process all of the generated WUs before they are published, in order to detect all short-running WUs and never send them to your contributors in the first place? A naive implementation of this would require circa 10...15 CPU seconds for each generated WU.
10) Message boards : Sixtrack Application : exceeded elapsed time limit 30940.80 (180000000.00G/5817.56G) (Message 30973)
Posted 23 Jun 2017 by xii5ku
Post:
Thanks to all with your help for this; we shall be making sure that these short
runs are treated as outliers in future. Sorry for that. However I think we are
over the worst, at least I hope so. Eric.


"We" are definitely not over the worst.

You are still sending out huge series of 5-seconds tasks. And these still corrupt the clients by severe miscalculation of app_version.flops.

I still see all of my hosts which run sixtrack tending towards dangerously over-estimated app_version.flops. They are in constant danger to be pushed over the edge by another series of short WUs, after which all proper WUs will error out.

Are you aware that this failure mode is a serious waste of computer time and networking bandwidth of your contributors?

Currently I am spending a lot of time to get host 10486566 back into working order. Current contents of its "Application details" page:
SixTrack 451.07 i686-pc-linux-gnu (pni)
  Number of tasks completed    82
          Max tasks per day    571
      Number of tasks today    573
    Consecutive valid tasks    80
    Average processing rate    7,219.69 GFLOPS
    Average turnaround time    0.01 days

It was even in the 15,000 GFLOPS range previously. Another host with same hardware is currently listed with 9 GFLOPS for SixTrack 451.07 i686-pc-linux-gnu (pni).

I am still not sure whether host 10486566 can ever be recovered. And even if I succeed, it will just be a temporary win until the next destructive series of 5-seconds WUs.

Thanks to Crystal Pellet for the pointers.



©2019 CERN