Message boards :
Sixtrack Application :
exceeded elapsed time limit 30940.80 (180000000.00G/5817.56G)
Message board moderation
Author | Message |
---|---|
Send message Joined: 14 May 15 Posts: 17 Credit: 11,627,311 RAC: 0 ![]() ![]() |
Many unit failing with this message, and actually correspond to units which exceed the indicated processing time.They stop and abort themselves at that moment programming error?, there are other units of the same type finishing Ok with processing times beyond 30940 seconds https://lhcathome.cern.ch/lhcathome/result.php?resultid=139382571 <core_client_version>7.6.31</core_client_version> <![CDATA[ <message> exceeded elapsed time limit 30940.80 (180000000.00G/5817.56G) </message> <stderr_txt> </stderr_txt> ]]> |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Thankyou for your report, very strange indeed and there is nothing "special" about these tasks that I can see. Will try and investigate further. Eric. |
Send message Joined: 14 May 15 Posts: 17 Credit: 11,627,311 RAC: 0 ![]() ![]() |
It continues happening in that host, suspending similar tasks.Let's see what happen with wzero_ ones |
Send message Joined: 14 May 15 Posts: 17 Credit: 11,627,311 RAC: 0 ![]() ![]() |
Same issue, suspending LHC in that host, |
Send message Joined: 2 May 07 Posts: 2262 Credit: 175,581,097 RAC: 1,442 ![]() ![]() ![]() |
The same task was finished from a other PC after more than ONE DAY! https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=67361476 |
Send message Joined: 14 Jan 10 Posts: 1443 Credit: 9,699,774 RAC: 1,389 ![]() ![]() ![]() |
This floating point was reported much too high. Therefore the client has calculated a much shorter time to finish. Meanwhile that machine is reporting a measured floating point speed of 1957.67 million ops/second. If that value was used when requesting tasks, you would have 91946.04 seconds to finish a task. |
Send message Joined: 14 May 15 Posts: 17 Credit: 11,627,311 RAC: 0 ![]() ![]() |
How that could happen? |
Send message Joined: 14 May 15 Posts: 17 Credit: 11,627,311 RAC: 0 ![]() ![]() |
And, is there a way to identify which tasks have this problem? I see in my log successful and errored tasks downloaded at the same time |
Send message Joined: 14 May 15 Posts: 17 Credit: 11,627,311 RAC: 0 ![]() ![]() |
It is the sse2 application, I've changed the value of flops in the client_state.xml file but it reverts to the previous figure of 5817 Ggplos... any idea? |
Send message Joined: 14 Jan 10 Posts: 1443 Credit: 9,699,774 RAC: 1,389 ![]() ![]() ![]() |
It is the sse2 application, I've changed the value of flops in the client_state.xml file but it reverts to the previous figure of 5817 Ggplos... Probably all tasks downloaded before you changed to the lower fpops will still have the higher fpops in the workunit settings. If you are already hacking the client_state.xml, you could increase the <rsc_fpops_bound> for those workunits with a factor 10. |
Send message Joined: 14 May 15 Posts: 17 Credit: 11,627,311 RAC: 0 ![]() ![]() |
It is the sse2 application, I've changed the value of flops in the client_state.xml file but it reverts to the previous figure of 5817 Ggplos... Thank for the suggestion but it is also reverting after I chnge it. |
Send message Joined: 22 Mar 17 Posts: 30 Credit: 360,676 RAC: 0 ![]() ![]() |
The average processing rate for the x86_64 sse2 version on your host is hundred or so times larger than it should be. You have had a couple hundred short running tasks. BOINC expects that runtime of task is proportional to its FLOPS estimate. Short running tasks like you have had could have made BOINC think your computer is really super fast. Projects that have tasks like these are supposed to code their validators so that unusual tasks are marked as runtime outliers. Sixtrack validator seems to have that code (some app versions for my host have Consecutive valid tasks higher than Number of tasks completed) but I think there could be a bug in the code and some short running tasks are not marked runtime outliers and are allowed to influence runtime estimates. You can help yourself out of this situation by increasing <rsc_fpops_bound> of Sixtrack tasks 1000 times larger or possible even more. Before you edit client_state.xml you must shutdown BOINC client and make sure BOINC Manager or your OS doesn't automatically restart it until you are done with the edits. |
Send message Joined: 14 May 15 Posts: 17 Credit: 11,627,311 RAC: 0 ![]() ![]() |
The trick was to shutdown boincmanager before making changes to client_state.xml. Doing it so, either changing application fplos or wu rsc_fpops_bound work like a charm. Thanks! |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
Thanks to all with your help for this; we shall be making sure that these short runs are treated as outliers in future. Sorry for that. However I think we are over the worst, at least I hope so. Eric. |
Send message Joined: 7 May 17 Posts: 10 Credit: 6,952,848 RAC: 0 ![]() ![]() |
Thanks to all with your help for this; we shall be making sure that these short "We" are definitely not over the worst. You are still sending out huge series of 5-seconds tasks. And these still corrupt the clients by severe miscalculation of app_version.flops. I still see all of my hosts which run sixtrack tending towards dangerously over-estimated app_version.flops. They are in constant danger to be pushed over the edge by another series of short WUs, after which all proper WUs will error out. Are you aware that this failure mode is a serious waste of computer time and networking bandwidth of your contributors? Currently I am spending a lot of time to get host 10486566 back into working order. Current contents of its "Application details" page: [CODE]SixTrack 451.07 i686-pc-linux-gnu (pni) Number of tasks completed 82 Max tasks per day 571 Number of tasks today 573 Consecutive valid tasks 80 Average processing rate 7,219.69 GFLOPS Average turnaround time 0.01 days [/CODE] It was even in the 15,000 GFLOPS range previously. Another host with same hardware is currently listed with 9 GFLOPS for SixTrack 451.07 i686-pc-linux-gnu (pni). I am still not sure whether host 10486566 can ever be recovered. And even if I succeed, it will just be a temporary win until the next destructive series of 5-seconds WUs. Thanks to Crystal Pellet for the pointers. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
I agree this is a TERRIBLE problem. Since my hands are tied, I am going to try and find and delete all these high amplitude short runs. (The problem is complicated because not all of them may be genuine, but in your case I think they are.) I shall try and check. I am still WAITING for the fix. I shall also try and run a few but it is difficult for me to find the data files as they are deleted from the download directory. I am not sure about the impact of the deletion, but I reckon I have to do it. I can only apologise AGAIN (I am fed up apologising) but I do not have any authority, I cannot change much. I think the recent fixes for "Tasks not distributed" or, because you have over 50 processors, implies that you might well get a whole bunch of short runs. I am on the verge of cracking up over all this as the problem is KNOWN, UNDERSTOOD but NOT fixed. Eric. |
Send message Joined: 7 May 17 Posts: 10 Credit: 6,952,848 RAC: 0 ![]() ![]() |
Meanwhile, two more of my machines have contracted the same disease as host 10486566: They downloaded a few hundred tasks in one go, and the estimated runtime of these tasks was set to about 10 seconds. Luckily I noticed this before the machines began computing these tasks. (I shut down the clients, edited client_state.xml for 10,000 times larger workunit.rsc_fpops_bound, and restarted the clients.) Conclusion: You absolutely cannot run Sixtrack exclusively for more than a day on dual-socket machines, unless you are ready for repeated client_state.xml manipulations. (Smaller hosts may take longer to contract this issue, I guess.) I am looking forward to your fix of Sixtrack's validator. In addition: Would it be feasible to process all of the generated WUs before they are published, in order to detect all short-running WUs and never send them to your contributors in the first place? A naive implementation of this would require circa 10...15 CPU seconds for each generated WU. |
Send message Joined: 7 May 17 Posts: 10 Credit: 6,952,848 RAC: 0 ![]() ![]() |
Eric, thanks for your reply (I only saw it after I sent my latter post), and thanks for chasing all those interrelated problems. As a layman, I hardly have an idea of the hurdles you are encountering on your way to get this solved eventually. |
Send message Joined: 22 Mar 17 Posts: 30 Credit: 360,676 RAC: 0 ![]() ![]() |
@Eric If you need help I could take a look at the validator. I'm of no use with the science but I'm good at reading code and finding bugs. |
Send message Joined: 12 Jul 11 Posts: 857 Credit: 1,619,050 RAC: 0 |
OK, thanks for your support. I have TRIED to delete all w-c6 Tasks. You may have some cached locally. Apparently there are more. We are giving up for tonight. I shall look again first thing in the morning. So much for a break! :-) Eric. |
©2025 CERN