Message boards : Sixtrack Application : exceeded elapsed time limit 30940.80 (180000000.00G/5817.56G)
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Trotador

Send message
Joined: 14 May 15
Posts: 17
Credit: 11,627,311
RAC: 0
Message 30336 - Posted: 14 May 2017, 19:34:41 UTC
Last modified: 14 May 2017, 19:47:15 UTC

Many unit failing with this message, and actually correspond to units which exceed the indicated processing time.They stop and abort themselves at that moment

programming error?, there are other units of the same type finishing Ok with processing times beyond 30940 seconds

https://lhcathome.cern.ch/lhcathome/result.php?resultid=139382571
<core_client_version>7.6.31</core_client_version>
<![CDATA[
<message>
exceeded elapsed time limit 30940.80 (180000000.00G/5817.56G)
</message>
<stderr_txt>

</stderr_txt>
]]>
ID: 30336 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 30341 - Posted: 14 May 2017, 22:45:33 UTC - in response to Message 30336.  

Thankyou for your report, very strange indeed and there is nothing "special" about these
tasks that I can see. Will try and investigate further. Eric.
ID: 30341 · Report as offensive     Reply Quote
Trotador

Send message
Joined: 14 May 15
Posts: 17
Credit: 11,627,311
RAC: 0
Message 30343 - Posted: 15 May 2017, 4:27:34 UTC

It continues happening in that host, suspending similar tasks.Let's see what happen with wzero_ ones
ID: 30343 · Report as offensive     Reply Quote
Trotador

Send message
Joined: 14 May 15
Posts: 17
Credit: 11,627,311
RAC: 0
Message 30344 - Posted: 15 May 2017, 4:34:08 UTC

Same issue, suspending LHC in that host,
ID: 30344 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,090,946
RAC: 103,877
Message 30345 - Posted: 15 May 2017, 6:01:56 UTC

The same task was finished from a other PC after more than ONE DAY!

https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=67361476
ID: 30345 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 30346 - Posted: 15 May 2017, 6:08:59 UTC - in response to Message 30336.  


exceeded elapsed time limit 30940.80 (180000000.00G/5817.56G)

This floating point was reported much too high.
Therefore the client has calculated a much shorter time to finish.

Meanwhile that machine is reporting a measured floating point speed of 1957.67 million ops/second.
If that value was used when requesting tasks, you would have 91946.04 seconds to finish a task.
ID: 30346 · Report as offensive     Reply Quote
Trotador

Send message
Joined: 14 May 15
Posts: 17
Credit: 11,627,311
RAC: 0
Message 30348 - Posted: 15 May 2017, 9:25:18 UTC - in response to Message 30346.  


exceeded elapsed time limit 30940.80 (180000000.00G/5817.56G)

This floating point was reported much too high.
Therefore the client has calculated a much shorter time to finish.

Meanwhile that machine is reporting a measured floating point speed of 1957.67 million ops/second.
If that value was used when requesting tasks, you would have 91946.04 seconds to finish a task.


How that could happen?
ID: 30348 · Report as offensive     Reply Quote
Trotador

Send message
Joined: 14 May 15
Posts: 17
Credit: 11,627,311
RAC: 0
Message 30349 - Posted: 15 May 2017, 9:59:38 UTC - in response to Message 30348.  

And, is there a way to identify which tasks have this problem? I see in my log successful and errored tasks downloaded at the same time
ID: 30349 · Report as offensive     Reply Quote
Trotador

Send message
Joined: 14 May 15
Posts: 17
Credit: 11,627,311
RAC: 0
Message 30350 - Posted: 15 May 2017, 12:34:45 UTC

It is the sse2 application, I've changed the value of flops in the client_state.xml file but it reverts to the previous figure of 5817 Ggplos...

any idea?
ID: 30350 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 30351 - Posted: 15 May 2017, 13:37:33 UTC - in response to Message 30350.  

It is the sse2 application, I've changed the value of flops in the client_state.xml file but it reverts to the previous figure of 5817 Ggplos...

any idea?

Probably all tasks downloaded before you changed to the lower fpops will still have the higher fpops in the workunit settings.

If you are already hacking the client_state.xml, you could increase the <rsc_fpops_bound> for those workunits with a factor 10.
ID: 30351 · Report as offensive     Reply Quote
Trotador

Send message
Joined: 14 May 15
Posts: 17
Credit: 11,627,311
RAC: 0
Message 30354 - Posted: 15 May 2017, 16:45:12 UTC - in response to Message 30351.  

It is the sse2 application, I've changed the value of flops in the client_state.xml file but it reverts to the previous figure of 5817 Ggplos...

any idea?

Probably all tasks downloaded before you changed to the lower fpops will still have the higher fpops in the workunit settings.

If you are already hacking the client_state.xml, you could increase the <rsc_fpops_bound> for those workunits with a factor 10.


Thank for the suggestion but it is also reverting after I chnge it.
ID: 30354 · Report as offensive     Reply Quote
Juha

Send message
Joined: 22 Mar 17
Posts: 30
Credit: 360,676
RAC: 0
Message 30357 - Posted: 15 May 2017, 18:31:25 UTC - in response to Message 30336.  

The average processing rate for the x86_64 sse2 version on your host is hundred or so times larger than it should be.

You have had a couple hundred short running tasks. BOINC expects that runtime of task is proportional to its FLOPS estimate. Short running tasks like you have had could have made BOINC think your computer is really super fast.

Projects that have tasks like these are supposed to code their validators so that unusual tasks are marked as runtime outliers. Sixtrack validator seems to have that code (some app versions for my host have Consecutive valid tasks higher than Number of tasks completed) but I think there could be a bug in the code and some short running tasks are not marked runtime outliers and are allowed to influence runtime estimates.

You can help yourself out of this situation by increasing <rsc_fpops_bound> of Sixtrack tasks 1000 times larger or possible even more. Before you edit client_state.xml you must shutdown BOINC client and make sure BOINC Manager or your OS doesn't automatically restart it until you are done with the edits.
ID: 30357 · Report as offensive     Reply Quote
Trotador

Send message
Joined: 14 May 15
Posts: 17
Credit: 11,627,311
RAC: 0
Message 30378 - Posted: 17 May 2017, 18:18:46 UTC

The trick was to shutdown boincmanager before making changes to client_state.xml. Doing it so, either changing application fplos or wu rsc_fpops_bound work like a charm.

Thanks!
ID: 30378 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 30379 - Posted: 17 May 2017, 18:30:45 UTC

Thanks to all with your help for this; we shall be making sure that these short
runs are treated as outliers in future. Sorry for that. However I think we are
over the worst, at least I hope so. Eric.
ID: 30379 · Report as offensive     Reply Quote
xii5ku

Send message
Joined: 7 May 17
Posts: 10
Credit: 6,952,848
RAC: 0
Message 30973 - Posted: 23 Jun 2017, 14:16:58 UTC - in response to Message 30379.  

Thanks to all with your help for this; we shall be making sure that these short
runs are treated as outliers in future. Sorry for that. However I think we are
over the worst, at least I hope so. Eric.


"We" are definitely not over the worst.

You are still sending out huge series of 5-seconds tasks. And these still corrupt the clients by severe miscalculation of app_version.flops.

I still see all of my hosts which run sixtrack tending towards dangerously over-estimated app_version.flops. They are in constant danger to be pushed over the edge by another series of short WUs, after which all proper WUs will error out.

Are you aware that this failure mode is a serious waste of computer time and networking bandwidth of your contributors?

Currently I am spending a lot of time to get host 10486566 back into working order. Current contents of its "Application details" page:
SixTrack 451.07 i686-pc-linux-gnu (pni)
  Number of tasks completed    82
          Max tasks per day    571
      Number of tasks today    573
    Consecutive valid tasks    80
    Average processing rate    7,219.69 GFLOPS
    Average turnaround time    0.01 days

It was even in the 15,000 GFLOPS range previously. Another host with same hardware is currently listed with 9 GFLOPS for SixTrack 451.07 i686-pc-linux-gnu (pni).

I am still not sure whether host 10486566 can ever be recovered. And even if I succeed, it will just be a temporary win until the next destructive series of 5-seconds WUs.

Thanks to Crystal Pellet for the pointers.
ID: 30973 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 30982 - Posted: 23 Jun 2017, 18:45:39 UTC - in response to Message 30973.  

I agree this is a TERRIBLE problem. Since my hands are tied, I am going to try and
find and delete all these high amplitude short runs. (The problem is complicated because
not all of them may be genuine, but in your case I think they are.) I shall try and check.
I am still WAITING for the fix.
I shall also try and run a few but it is difficult for me to find the data files as they
are deleted from the download directory. I am not sure about the impact of the
deletion, but I reckon I have to do it. I can only apologise AGAIN (I am fed up
apologising) but I do not have any authority, I cannot change much.

I think the recent fixes for "Tasks not distributed" or, because you have over 50 processors,
implies that you might well get a whole bunch of short runs. I am on the verge of
cracking up over all this as the problem is KNOWN, UNDERSTOOD but NOT fixed.
Eric.
ID: 30982 · Report as offensive     Reply Quote
xii5ku

Send message
Joined: 7 May 17
Posts: 10
Credit: 6,952,848
RAC: 0
Message 30983 - Posted: 23 Jun 2017, 18:47:48 UTC - in response to Message 30973.  

Meanwhile, two more of my machines have contracted the same disease as host 10486566: They downloaded a few hundred tasks in one go, and the estimated runtime of these tasks was set to about 10 seconds. Luckily I noticed this before the machines began computing these tasks. (I shut down the clients, edited client_state.xml for 10,000 times larger workunit.rsc_fpops_bound, and restarted the clients.)

Conclusion: You absolutely cannot run Sixtrack exclusively for more than a day on dual-socket machines, unless you are ready for repeated client_state.xml manipulations. (Smaller hosts may take longer to contract this issue, I guess.)

I am looking forward to your fix of Sixtrack's validator.

In addition: Would it be feasible to process all of the generated WUs before they are published, in order to detect all short-running WUs and never send them to your contributors in the first place? A naive implementation of this would require circa 10...15 CPU seconds for each generated WU.
ID: 30983 · Report as offensive     Reply Quote
xii5ku

Send message
Joined: 7 May 17
Posts: 10
Credit: 6,952,848
RAC: 0
Message 30984 - Posted: 23 Jun 2017, 18:53:10 UTC - in response to Message 30982.  

Eric, thanks for your reply (I only saw it after I sent my latter post), and thanks for chasing all those interrelated problems. As a layman, I hardly have an idea of the hurdles you are encountering on your way to get this solved eventually.
ID: 30984 · Report as offensive     Reply Quote
Juha

Send message
Joined: 22 Mar 17
Posts: 30
Credit: 360,676
RAC: 0
Message 30985 - Posted: 23 Jun 2017, 20:30:40 UTC

@Eric

If you need help I could take a look at the validator. I'm of no use with the science but I'm good at reading code and finding bugs.
ID: 30985 · Report as offensive     Reply Quote
Eric Mcintosh
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 12 Jul 11
Posts: 857
Credit: 1,619,050
RAC: 0
Message 30987 - Posted: 23 Jun 2017, 22:40:25 UTC - in response to Message 30984.  

OK, thanks for your support. I have TRIED to delete all w-c6 Tasks.
You may have some cached locally. Apparently there are more.
We are giving up for tonight. I shall look again first thing in the morning.
So much for a break! :-) Eric.
ID: 30987 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Sixtrack Application : exceeded elapsed time limit 30940.80 (180000000.00G/5817.56G)


©2024 CERN