1)
Message boards :
Number crunching :
Big problem: work units running with negative time.
(Message 25846)
Posted 19 Sep 2013 by Igor Zacharov Post: I'm happy it works so far. Full credit for the solution should go to Pauli Nieminen. Pauli - thank you much for finding the problem! Your debugging skills played pivotal role when homing on the culprit. Igor. |
2)
Message boards :
Number crunching :
Big problem: work units running with negative time.
(Message 25843)
Posted 18 Sep 2013 by Igor Zacharov Post: it occurred to me that it may be difficult to select only the test flow as opposed to the general work. Therefore, the 4465 execs for Linux are also in the production directory. Please, give it a try. Thanks, Igor. |
3)
Message boards :
Number crunching :
Big problem: work units running with negative time.
(Message 25842)
Posted 18 Sep 2013 by Igor Zacharov Post: Friends, we may have a remedy for the failed Linux execs. This is the version 4465. It is installed only for the Linux machines, it should work now but we need testing. It is in the sixtracktest (you may need to allow for test work to flow). Can experts with the Linux machines allow the 4465 to run (higher version should be taken automatically) and report back if the cure worked? Thank you, Igor. |
4)
Message boards :
Number crunching :
LHC BOINC work credits
(Message 25782)
Posted 6 Sep 2013 by Igor Zacharov Post: Since I took time to go through these postings, I can as well comment. We don't do anything that would artificially lower or increase the credits. All credit assignments as per boinc library settings and it is the NewCredit system. In the past, we did experiment assigning credits based on the runtime os sixtrack and that prevented people cheating with the old boinc assignment system. However, I found our runtime based credit was close to the standard boinc credit. Therefore, now the standard is in place. I guess, I could easily put a multiplier in there to attract credit-conscious people to the project. However, if all admins start a contest like this we will create an inflation spectacle to the dismay and joy to all of you, dear volunteers. Igor. |
5)
Message boards :
Number crunching :
Big problem: work units running with negative time.
(Message 25780)
Posted 6 Sep 2013 by Igor Zacharov Post: There are problems with the Linux version 4463 and this is not related to naming. I have verified, that even so the executables have .exe extension they are Linux. Therefore, you don't need Wine to run them, it is all very standard. By the way, I have corrected the scripts to name them differently next time. There seems to be a correlation with Boinc Client version 7.x under Linux. I'm not suggesting to go back in Client version, we must have a clean executable. We are looking at the compilation of time accounting routines under Linux, since there is evidence pointing to the mktime routine. Probably will take until monday to resolve. Sorry for the inconvenience. Igor. |
6)
Message boards :
Number crunching :
new WUs? not for me. why?
(Message 25567)
Posted 10 May 2013 by Igor Zacharov Post: we had to back off to previous version of the software. In the new version, there is a problem with unpacking the input file. Now the distribution of the work should work. Thank you for your patience and sorry for the troubles. |
7)
Message boards :
Number crunching :
Upload Errors
(Message 25536)
Posted 2 May 2013 by Igor Zacharov Post: All, we have corrected the upload error, thank you for your patience. We make few other changes to the system, upgrading the version of the executable and bringing back more of the generated output files. We are running tests for this now. Igor. |
8)
Message boards :
Cafe LHC :
rogue BOINC project: ... - THREAD CLOSED
(Message 24987)
Posted 30 Nov 2012 by Igor Zacharov Post: Dagorath - Jujube (I know you are one and the same person) we provide here a place for free discussion related to our projects. Especially on "Cafe", it is rather open for all possible subjects. It is not permissible to question personal integrity of people involved. Your personal feelings about the way projects on the internet are going do not justify in any way the comments you make and the tone in which you make them. In fact, these obsessive attacks make any sensible discussion impossible. This has to stop now, with immediate effect. Igor Zacharov. |
9)
Message boards :
Number crunching :
Unable to verify signature with certificates
(Message 24783)
Posted 4 Sep 2012 by Igor Zacharov Post: Lurker, You may want to restart the boinc client on that machine. Igor. |
10)
Message boards :
Number crunching :
Credits
(Message 24685)
Posted 22 Aug 2012 by Igor Zacharov Post: Petri, the WU#2490815 you indicate as example is 10M turns and was running 131801 seconds on your machine and 635141 seconds on your wingman. Your wingman was elected to deliver the canonical result, therefore for both of you the credit was calculated: 2012-08-16 06:10:18.4368 [WU#2490815][RESULT#5464810] credit_from_runtime 155.83 = 36000s * 1.87GFLOPS You see the caping here at 10 hours (3600 seconds) and it is our mistake. The assignment itself was done based on your wingman computer speed. Here is an example of WU#1985443 with 1M turns where your computer was elected to deliver the canonical result: 2012-07-24 04:44:26.0601 [WU#1985443][RESULT#4393043] credit_from_runtime 811.37 = 35051s * 10.00GFLOPS Here both, you and your wingman were assigned a credit of 811 where a "normal" around 150 would be appropriage. When applying the corrections I looked over the period of 3 weeks (from 23/7 till 16/08 with 1M and 10M turns jobs) and summed up all contributions. For your amusement, here is the debug of the credit calculation for your host: [RESULT#5887203] raw credit: 367.66 (15883.05 sec, 10.00 est GFLOPS) [RESULT#5887203] anon platform, scaling by 0.352973 (0.25/0.71) [RESULT#5887203] anon platform, returning 129.78 [RESULT#5887203] updating HAV PFC 0.88 et 8.82392e-11 turnaround 56011 [RESULT#5887203] get_pfc() returns credit 129.775 mode approx your wingman: [RESULT#5887204] raw credit: 67.15 (23261.86 sec, 1.25 est GFLOPS) [RESULT#5887204] [AV#61] normal case. 23262 sec, 1.2 GFLOPS. raw credit: 67.15 [RESULT#5887204] host scale: 1.66 (0.213730/0.128826) [RESULT#5887204] applying app version scale 1.171 [RESULT#5887204] [AV#61] PFC avgs with 0.16116 (2.90088e+13/1.8e+14) [RESULT#5887204] updating HAV PFC 0.16 et 1.29233e-10 turnaround 85161 [RESULT#5887204] get_pfc() returns credit 130.483 mode normal [WU#2685431] assign_credit_set: credit 130.483 thus the normal for your and your wingman's host would be 130 in this case, but: [WU#2685431][RESULT#5887203] credit_from_runtime 367.66 = 15883s * 10.00GFLOPS [RESULT#5887203 wlxscan_wcbb6_....._sixvf_boinc743_0] Valid; granted 367.663303 credit [HOST#9964580] We are now set to apply the default credit calculation, so you don't have to do anything. This is just for your understanding and amusement. |
11)
Message boards :
Number crunching :
Credits
(Message 24681)
Posted 21 Aug 2012 by Igor Zacharov Post: Petri, I guess you wonder why your credits in particular were not upgraded by much, althouth admitedly you did a lot of work. When designing the system, I did not look at user-ids at all, only at hosts and work-units. This is important, since the system must be objective in a sense. It calculates the (upgrade=#sixtrack_loops*Flops - given_credits) and only if the upgrade is positive applies the change. No user id is involved. But to analyze more, your hosts in particular use the 10 GF mark. Therefore, when you deliver the canonical result in the run-time credits calculation your credits soar. Like this you did get a large credit from your work already and there was no need to upgrade. I hope this clarifies. I took your example as an opportunity to explain the strategy. I feel it is important that everybody understands we do not want to break the system that inventors of BOINC put together. It should be fair on all projects. We just correct our own mistakes. Igor. I decided to investigate some more. |
12)
Message boards :
Number crunching :
Credits
(Message 24663)
Posted 20 Aug 2012 by Igor Zacharov Post: following the analysis of the credits given for the long jobs of last 2 weeks we have decided to give additional credits based on our internal accounting system. There were several problems exposed due to your help. Most of it due to the choice using crediting based on real time. This had several consequences: 1) the real time credit was implemented with a cut at 10 hours. Therefore running long jobs did not get the due credits 2) as discussed already in the forum, old boinc clients do not report real time. Therefore, if the old client was the canonical result all results will get zero credits. 3) clever people implement Anonymous Platform with an artificially high performance value assigned. A slow platform will get dis-proportionally more credits, because the credit is calculated as time * platform_performance. We have analyzed what the credits would be like when using our internal accounting system based on sixtrack reported values. For each host - if our credit system would give more credit - we have build up an update table, which was applied to the data base. If our system would give less credit, we have not touched the assigned values. For most people it gives few 1000 more points, for some it gives few 10000 more points. Please, look at your credits, if you care, and if you find problems, discrepancies or have comments, write to this thread or to my private inbox. I will be looking at the system on wednesday (22nd of august) again. Going forward, we are running with the default credit system. It seems to take care of the Anonymous Platform in a correct way and it assigns credit values similar to the internal sixtrack accounting. Please, report any thoughts or observations. Thank you for your support and patience in this matter. Igor. |
13)
Message boards :
Number crunching :
cabaret !
(Message 24633)
Posted 17 Aug 2012 by Igor Zacharov Post: Friends, most problems with the credits should be corrected now. The "0 credits" problem is corrected, as well as the long jobs should get the proper credits now. I will post results of the analysis of what went wrong later, after additional work on the system. Thank you for your support. Igor. |
14)
Message boards :
Number crunching :
Request for SSE4 and/or OpenCL applications
(Message 24422)
Posted 23 Jul 2012 by Igor Zacharov Post: yes, executable compiled with SSE4 option could be there when we update the system next time. For the GPU based computing it will take longer. It is not sufficient to pass it from a compiler - profound changes are necessary at algorithmic level. We have it on the roadmap. Igor. |
15)
Message boards :
Number crunching :
How is the SSE3 thing coming along?
(Message 24418)
Posted 23 Jul 2012 by Igor Zacharov Post: Richard, thank you for the suggestions in your posting. We have now implemented the outliers and the credit based on runtime. I have also increased a little the max_wus to_send and in_progress. For the executables, the ppn and the sse3 binaries are exactly the same, while we believe the difference with sse2 in runtime is small. Therefore, we left it as is for now. Next change to the executables will be recompilation due to introduction of new physics and arrival of the mac version in a few weeks time. With that, we will analyze the working of the system. Thanks to all experts who helped to make the LHC project better. Igor. |
16)
Message boards :
Number crunching :
Database Error
(Message 24417)
Posted 23 Jul 2012 by Igor Zacharov Post: it should be corrected now, although we don't know why teamid=7228 has been added to your database entry. By the way, Team Musketeers to which you have been apparently added have a different id. It is probably a bug with the boinc software somewhere. We are not in the position to trace it thought. Igor. |
17)
Message boards :
Number crunching :
Possible explanation for "No Tasks sent ..."
(Message 24392)
Posted 18 Jul 2012 by Igor Zacharov Post: these are all very good suggestions and thank you for the analysis. We use the following configuragion flags at the moment: <reliable_on_priority> 1 </reliable_on_priority> <reliable_max_avg_turnaround> 230400 </reliable_max_avg_turnaround> <reliable_max_error_rate> 0.100000 </reliable_max_error_rate> <reliable_reduced_delay_bound> 0.5 </reliable_reduced_delay_bound> <reliable_priority_on_over> 0 </reliable_priority_on_over> <reliable_priority_on_over_except_error> 1 </reliable_priority_on_over_except_error> <min_sendwork_interval> 10 </min_sendwork_interval> <daily_result_quota> 5 </daily_result_quota> <ignore_delay_bound> 1 </ignore_delay_bound> <one_result_per_user_per_wu> 1 </one_result_per_user_per_wu> <max_wus_to_send> 2 </max_wus_to_send> <max_wus_in_progress> 3 </max_wus_in_progress> <resend_lost_results> 1 </resend_lost_results> <next_rpc_delay> 18000 </next_rpc_delay> <report_grace_period> 0 </report_grace_period> I found a critical parameter is <daily_result_quota>. It seems that scheduler would only consider hosts "reliable" with that many jobs on record. Therefore, bringing it down allows more work to flow. Please, tell me if you have any other suggestions. |
18)
Message boards :
Number crunching :
Remaining issues
(Message 24358)
Posted 14 Jul 2012 by Igor Zacharov Post: There are several issues I need you help with: 1) work is not being taken because the disk is full on the client side. We need about 200 MB of space available to store the executable and the data I see that some machines are not configured correctly or overflow disk space. 2) The "GET method" messages: this is the most puzzling one, what is this? we have in the scheduler log: Incomplete request received (used GET method - probably a browser) from IP .xx. (hundreds of different IPs), auth , platform , version 0.0.0 server sends back "xp.get_tag() failed" and does not assign any work. 3) received old codesign key It is necessary to reset the project on machines which get this message. |
19)
Message boards :
Number crunching :
Computation Error
(Message 24356)
Posted 14 Jul 2012 by Igor Zacharov Post: yes, the work which was send after 7 pm CET on friday 13 of july should have the correct execution time limit set. All the wus you have listed are from before the cutoff. If in doubt, reset the project. Igor, is the error with the SSE3 WUs now removed. |
20)
Message boards :
Number crunching :
Computation Error
(Message 24352)
Posted 14 Jul 2012 by Igor Zacharov Post: Matthias, what may have happened is that the tasks were downloaded to your computer with the wrong time limit already and me changing this parameter after the download on the server could not prevent this crash. What may help is to reset the project. You will get new tasks with the correct limit setting. It would be interesting to find out if this is the right recepy.
|
©2024 CERN