Message boards :
Sixtrack Application :
SIXTRACKTEST
Message board moderation
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · Next
Author | Message |
---|---|
Send message Joined: 14 Jan 10 Posts: 1417 Credit: 9,441,018 RAC: 1,047 |
@Alessio Mereghetti: Some long-runners in between these sixtrack test tasks. I have running w-c6_job.B1topenergy.b6onIRoff_c6.2052__1__s__62.31_60.32__6.1_8.1__7__70.5_1_sixvf_boinc106_2 w-c6_job.B1topenergy.b6onIRoff_c6.2052__1__s__62.31_60.32__4.1_6.1__7__10.5_1_sixvf_boinc7_0 w-c2_job.B2topenergy.b6onIRon_c2.2052__1__s__62.31_60.32__8.1_10.1__7__34.5_1_sixvf_boinc520_2 w-c2_job.B2topenergy.b6onIRon_c2.2052__1__s__62.31_60.32__6.1_8.1__7__63_1_sixvf_boinc490_2 I aborted on another machine three of that type with 10^7 turns. |
Send message Joined: 2 May 07 Posts: 2242 Credit: 173,897,779 RAC: 2,820 |
Hi Crystal, sixtracktest is finishing all tasks so long on this Computer. https://lhcathome.cern.ch/lhcathome/results.php?hostid=10592132&offset=0&show_names=0&state=0&appid=10 |
Send message Joined: 22 Mar 17 Posts: 62 Credit: 14,576,403 RAC: 10,212 |
Have a Computer with only sixtracktest for the Moment: I have a lot of these today on one PC. Some complete, some error at the same time. Most successful ones complete in a couple of seconds. <core_client_version>7.9.3</core_client_version> <![CDATA[ <message> exceeded elapsed time limit 17457.49 (1920000000.00G/109981.44G)</message> <stderr_txt> </stderr_txt> ]]> They quickly progress to about 50% then go back to 0% and then very slowly progress. Dumped the rest on that PC. |
Send message Joined: 14 Jan 10 Posts: 1417 Credit: 9,441,018 RAC: 1,047 |
w-c2_job.B2topenergy.b6onIRon_c2.2052__1__s__62.31_60.32__6.1_8.1__7__63_1_sixvf_boinc490_2The 4 tasks mentioned before are running now 24 hours and 42 minutes. Calculating with elapsed time versus progress% they will finish after 77.29, 81.89, 95.77 and 98.47 hours, if not killed by '197 (0x000000C5) EXIT_TIME_LIMIT_EXCEEDED'. |
Send message Joined: 29 Sep 04 Posts: 281 Credit: 11,866,264 RAC: 0 |
The clue might be in the name. sixtrackTEST, therefore jobs may not work as expected or even at all. If you only want to run reliable jobs then uncheck the "Run test applications" box. You might also consider turning down your work buffer so you don't get so many. Aborting hundreds of tasks isn't helpful. I have only been able to get 12 of these. Some have been short, some are over 30hrs. The short ones (unstable beam parameters) can be just as useful as the long ones. Any failures can be used to refine the setup and result in better performance of future tasks. |
Send message Joined: 1 Dec 05 Posts: 62 Credit: 11,441,610 RAC: 0 |
I received 8 Sixtracktest WU. I also see that some users on this message page have long units of 30 --100 hrs. The 8 units I have are next in line to start. All have a guesstimate of 7d 17:08:55. That is a far cry from what others claim! I've had a few of these last mo. and none of them lasted that long. I hope that holds true for these!!! Pick |
Send message Joined: 1 Dec 05 Posts: 62 Credit: 11,441,610 RAC: 0 |
Well one unite ran for 17 Sec, Another foe 47Sec. Maybe the other 4 will be under the time limit! Pick |
Send message Joined: 22 Mar 17 Posts: 62 Credit: 14,576,403 RAC: 10,212 |
The clue might be in the name. sixtrackTEST, therefore jobs may not work as expected or even at all. If you only want to run reliable jobs then uncheck the "Run test applications" box. You might also consider turning down your work buffer so you don't get so many. Aborting hundreds of tasks isn't helpful. I have a buffer of 0.10 days or 2.4 hours. Barely any buffer at all. But the ETA was only several seconds and hundreds downloaded on a 32t system. I can't help that the project did not set realistic ETA for the longer tasks. Instead of wasting my PCs time and electricity for them to just end up in an error state I sent them back for someone else to hopefully complete. I ran my TEST for sixtractTEST so unless you have suggestions to fix the errors on the one PC (others PCs are still running) ya can step down off your pedestal. |
Send message Joined: 14 Jan 10 Posts: 1417 Credit: 9,441,018 RAC: 1,047 |
Calculating with elapsed time versus progress% they will finish after 77.29, 81.89, 95.77 and 98.47 hours, if not killed by '197 (0x000000C5) EXIT_TIME_LIMIT_EXCEEDED'. After almost 2 days the predictions are now: 73.17, 74.46, 92.54 and 98.12 hours run time. |
Send message Joined: 29 Sep 04 Posts: 281 Credit: 11,866,264 RAC: 0 |
My apologies, Mmonnin, if you felt I was preaching. That was not my intention. I genuinely want to offer assistance where it is requested and I am able to provide it. When sixtrack tasks were scarce, here, people would hoard more than their machines were capable of running before deadline and it seemed to take ages for them to be reissued and returned. I have a similar buffer set and limit my, woefully inadequate compared to yours, machines to only accept 5 tasks at a time. Actually, your strategy of Aborting all the spares is quite valid as they will, as you say, have been reissued (looks like they have fixed the reissue delay problem). With that one host reaching time limit errors after <4hrs, might there be something not right with the Benchmark on it so it gives a false estimate? [I started writing this before you hid your computers so it is no longer possible to compare how your other machines have fared or for others more knowledgeable than me to offer insight] I ran one of the tasks you Aborted, coincidentally on my only Linux machine, which finished successfully in 2+1/2 days. Some have finished in 10s of seconds, others look like running to 3 days. With these new 10^7 tasks being potentially much longer than standard, there is clearly an issue with the runtime estimate and the variable 30 seconds to 3 days actual runtime only adds to that confusion. Perhaps they should be confined to -dev until those issues are sorted out as users here expect things to "just work". |
Send message Joined: 24 Oct 04 Posts: 1172 Credit: 54,708,888 RAC: 12,819 |
Maybe we need a due date change with these so when wingman doesn't even run these tasks we don't have to wait forever to have them timed-out and resent and then wait again. I won't name any names but I see several and I guess it explains why they are "anonymous* |
Send message Joined: 14 Jan 10 Posts: 1417 Credit: 9,441,018 RAC: 1,047 |
Maybe we need a due date change with these so when wingman doesn't even run these tasks we don't have to wait forever to have them timed-out and resent and then wait again.Not such a good idea, I think, to reduce the 15 days deadline. My longest 10^7-task had a CPU time of 4 days, 2 hours, 26 minutes and 15 seconds. Not all crunchers have dedicated LHC-machines running 24/7 100% cpu. A machine maybe only works during daytime, not allowed to crunch when in use and could have tasks from other projects too. |
Send message Joined: 24 Oct 04 Posts: 1172 Credit: 54,708,888 RAC: 12,819 |
Maybe we need a due date change with these so when wingman doesn't even run these tasks we don't have to wait forever to have them timed-out and resent and then wait again.Not such a good idea, I think, to reduce the 15 days deadline. My longest 10^7-task had a CPU time of 4 days, 2 hours, 26 minutes and 15 seconds. Well I am only talking about these TEST units and you don't get them unless you have the preferences set to do so. THESE should not be run by members that aren't doing this 24/7 since they are "Tests" and there is not very many when we do have them. I have seen members with several of these Test WU's that have not even contacted the server for the last 10 days. WHY would it take 15 days for us that have the computers to do these FAST? I guarantee that I saw many of these Tests that will just hit that deadline and be sent back eventually. BUT while you are here........why do you suppose some members here get over 10,000 Credits per a single core Theory task the last couple months? And then others who do many more of these same tasks get 350 Credits? And as far as these Test tasks........well maybe they should all be over at -dev so they will get completed in a day or two instead of not at all. |
Send message Joined: 14 Jan 10 Posts: 1417 Credit: 9,441,018 RAC: 1,047 |
BUT while you are here........why do you suppose some members here get over 10,000 Credits per a single core Theory task the last couple months?Wrong thread. It's already discussed in this Theory Application thread, where you was the last poster atm. |
Send message Joined: 29 Feb 16 Posts: 157 Credit: 2,659,975 RAC: 0 |
Hi, sorry for the late reply - just back from vacation. The 10^7 turns jobs are (should be) sent only on sixtracktest (for the time being) due to their duration and not to mess up with regular production. We are planning to go in production with such a long time range of beam dynamics with split jobs (eg 10^7 turns = 10 consecutive jobs * 10^6 turns) instead of only one job. I have asked the scientist submitting these jobs to proceed slowly for not flooding volunteers with so long jobs. The first batch of jobs was sent out with the usual delay_boundof ~1w. We then increased the parameter to 2w, in order to decrease the amount of jobs killed because of the deadline not being met (and wasting useful resources). For instance, the four tasks reported by Crystal Pellet in his thread: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4296&postid=39537#39537
belong to this second batch. Concerning the WU with the errors reported by mmonnin https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4296&postid=39551#39551 and maeax https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4296&postid=39535#39535: https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=120080417 all tasks had a report deadline 2w after the sent time stamp, and the two failing tasks failed way earlier than that deadline. For https://lhcathome.cern.ch/lhcathome/result.php?resultid=238806725, the issue seems to be related to the rsc_fpops_boundparameter, whereas for the other task https://lhcathome.cern.ch/lhcathome/result.php?resultid=238806726, the issue might be something different - maybe a transient problem with the HD: https://boinc.mundayweb.com/wiki/index.php?title=Process_exited_with_code_2_(0x2,_-254) A. |
Send message Joined: 14 Jan 10 Posts: 1417 Credit: 9,441,018 RAC: 1,047 |
We are planning to go in production with such a long time range of beam dynamics with split jobs (eg 10^7 turns = 10 consecutive jobs * 10^6 turns) instead of only one job.If I understand that right, you have to wait on the result of the first job before you can create the 2nd consecutive job, wait on the return of the 2nd job, before you can create the 3rd consecutive job etc. up to the 10th and last job. |
Send message Joined: 29 Feb 16 Posts: 157 Credit: 2,659,975 RAC: 0 |
exactly - timewise, the final results may come later than going with single 10^7 turns jobs, but the overall BOINC and volunteer processing should be more efficient. In addition, this option could give us the opportunity to resume any job/study from its ending point :) |
Send message Joined: 29 Feb 16 Posts: 157 Credit: 2,659,975 RAC: 0 |
I think that your host has been hit by some very short (successful) tasks (with basically no dynamic aperture, a perfectly physical case) which led the BOINC server to think that the host is super-fast. The FPOPs in the error messages: <core_client_version>7.9.3</core_client_version> <![CDATA[ <message> exceeded elapsed time limit 17457.49 (1920000000.00G/109981.44G)</message> <stderr_txt> </stderr_txt> ]]> are too high to be real: 109981.44G We had a similar issue in 2017: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4273 which led to updating the validator, but I think we need to fine-tune it even further. Thanks for the precious feedback! A. |
Send message Joined: 24 Oct 04 Posts: 1172 Credit: 54,708,888 RAC: 12,819 |
|
Send message Joined: 29 Feb 16 Posts: 157 Credit: 2,659,975 RAC: 0 |
Thanks for pointing this out. Not clear what happened (cannot even see the owner of the machine) - it seems like the machine did not even started the others... |
©2024 CERN