SIXTRACKTEST

Author	Message
Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1335 Credit: 8,840,799 RAC: 1,093	Message 39537 - Posted: 9 Aug 2019, 9:07:56 UTC @Alessio Mereghetti: Some long-runners in between these sixtrack test tasks. I have running w-c6_job.B1topenergy.b6onIRoff_c6.2052__1__s__62.31_60.32__6.1_8.1__7__70.5_1_sixvf_boinc106_2 w-c6_job.B1topenergy.b6onIRoff_c6.2052__1__s__62.31_60.32__4.1_6.1__7__10.5_1_sixvf_boinc7_0 w-c2_job.B2topenergy.b6onIRon_c2.2052__1__s__62.31_60.32__8.1_10.1__7__34.5_1_sixvf_boinc520_2 w-c2_job.B2topenergy.b6onIRon_c2.2052__1__s__62.31_60.32__6.1_8.1__7__63_1_sixvf_boinc490_2 I aborted on another machine three of that type with 10^7 turns. ID: 39537 · Reply Quote

maeax Send message Joined: 2 May 07 Posts: 2171 Credit: 166,476,154 RAC: 88,933	Message 39538 - Posted: 9 Aug 2019, 9:44:21 UTC - in response to Message 39537. Hi Crystal, sixtracktest is finishing all tasks so long on this Computer. https://lhcathome.cern.ch/lhcathome/results.php?hostid=10592132&offset=0&show_names=0&state=0&appid=10 ID: 39538 · Reply Quote

mmonnin Send message Joined: 22 Mar 17 Posts: 55 Credit: 10,804,128 RAC: 7,637	Message 39551 - Posted: 9 Aug 2019, 21:59:49 UTC - in response to Message 39535. Have a Computer with only sixtracktest for the Moment: The wingman had this Error: exceeded elapsed time limit 13480.39 (1920000000.00G/109981.44G)</message> https://lhcathome.cern.ch/lhcathome/result.php?resultid=238806725 I have a lot of these today on one PC. Some complete, some error at the same time. Most successful ones complete in a couple of seconds. <core_client_version>7.9.3</core_client_version> <![CDATA[ <message> exceeded elapsed time limit 17457.49 (1920000000.00G/109981.44G)</message> <stderr_txt> </stderr_txt> ]]> They quickly progress to about 50% then go back to 0% and then very slowly progress. Dumped the rest on that PC. ID: 39551 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1335 Credit: 8,840,799 RAC: 1,093	Message 39556 - Posted: 10 Aug 2019, 7:54:35 UTC - in response to Message 39537. w-c2_job.B2topenergy.b6onIRon_c2.2052__1__s__62.31_60.32__6.1_8.1__7__63_1_sixvf_boinc490_2 w-c2_job.B2topenergy.b6onIRon_c2.2052__1__s__62.31_60.32__8.1_10.1__7__34.5_1_sixvf_boinc520_2 w-c6_job.B1topenergy.b6onIRoff_c6.2052__1__s__62.31_60.32__6.1_8.1__7__70.5_1_sixvf_boinc106_2 w-c6_job.B1topenergy.b6onIRoff_c6.2052__1__s__62.31_60.32__4.1_6.1__7__10.5_1_sixvf_boinc7_0 The 4 tasks mentioned before are running now 24 hours and 42 minutes. Calculating with elapsed time versus progress% they will finish after 77.29, 81.89, 95.77 and 98.47 hours, if not killed by '197 (0x000000C5) EXIT_TIME_LIMIT_EXCEEDED'. ID: 39556 · Reply Quote

Ray Murray Volunteer moderator Send message Joined: 29 Sep 04 Posts: 281 Credit: 11,866,264 RAC: 341	Message 39559 - Posted: 10 Aug 2019, 8:36:21 UTC - in response to Message 39551. Last modified: 10 Aug 2019, 8:39:49 UTC The clue might be in the name. sixtrackTEST, therefore jobs may not work as expected or even at all. If you only want to run reliable jobs then uncheck the "Run test applications" box. You might also consider turning down your work buffer so you don't get so many. Aborting hundreds of tasks isn't helpful. I have only been able to get 12 of these. Some have been short, some are over 30hrs. The short ones (unstable beam parameters) can be just as useful as the long ones. Any failures can be used to refine the setup and result in better performance of future tasks. ID: 39559 · Reply Quote

Robert Pick Send message Joined: 1 Dec 05 Posts: 62 Credit: 11,441,610 RAC: 0	Message 39565 - Posted: 10 Aug 2019, 18:18:19 UTC I received 8 Sixtracktest WU. I also see that some users on this message page have long units of 30 --100 hrs. The 8 units I have are next in line to start. All have a guesstimate of 7d 17:08:55. That is a far cry from what others claim! I've had a few of these last mo. and none of them lasted that long. I hope that holds true for these!!! Pick ID: 39565 · Reply Quote

Robert Pick Send message Joined: 1 Dec 05 Posts: 62 Credit: 11,441,610 RAC: 0	Message 39567 - Posted: 10 Aug 2019, 19:02:25 UTC Well one unite ran for 17 Sec, Another foe 47Sec. Maybe the other 4 will be under the time limit! Pick ID: 39567 · Reply Quote

mmonnin Send message Joined: 22 Mar 17 Posts: 55 Credit: 10,804,128 RAC: 7,637	Message 39572 - Posted: 10 Aug 2019, 23:17:49 UTC - in response to Message 39559. The clue might be in the name. sixtrackTEST, therefore jobs may not work as expected or even at all. If you only want to run reliable jobs then uncheck the "Run test applications" box. You might also consider turning down your work buffer so you don't get so many. Aborting hundreds of tasks isn't helpful. I have only been able to get 12 of these. Some have been short, some are over 30hrs. The short ones (unstable beam parameters) can be just as useful as the long ones. Any failures can be used to refine the setup and result in better performance of future tasks. I have a buffer of 0.10 days or 2.4 hours. Barely any buffer at all. But the ETA was only several seconds and hundreds downloaded on a 32t system. I can't help that the project did not set realistic ETA for the longer tasks. Instead of wasting my PCs time and electricity for them to just end up in an error state I sent them back for someone else to hopefully complete. I ran my TEST for sixtractTEST so unless you have suggestions to fix the errors on the one PC (others PCs are still running) ya can step down off your pedestal. ID: 39572 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1335 Credit: 8,840,799 RAC: 1,093	Message 39573 - Posted: 11 Aug 2019, 6:22:41 UTC - in response to Message 39556. Calculating with elapsed time versus progress% they will finish after 77.29, 81.89, 95.77 and 98.47 hours, if not killed by '197 (0x000000C5) EXIT_TIME_LIMIT_EXCEEDED'. After almost 2 days the predictions are now: 73.17, 74.46, 92.54 and 98.12 hours run time. ID: 39573 · Reply Quote

Ray Murray Volunteer moderator Send message Joined: 29 Sep 04 Posts: 281 Credit: 11,866,264 RAC: 341	Message 39574 - Posted: 11 Aug 2019, 9:04:33 UTC - in response to Message 39572. Last modified: 11 Aug 2019, 9:38:48 UTC My apologies, Mmonnin, if you felt I was preaching. That was not my intention. I genuinely want to offer assistance where it is requested and I am able to provide it. When sixtrack tasks were scarce, here, people would hoard more than their machines were capable of running before deadline and it seemed to take ages for them to be reissued and returned. I have a similar buffer set and limit my, woefully inadequate compared to yours, machines to only accept 5 tasks at a time. Actually, your strategy of Aborting all the spares is quite valid as they will, as you say, have been reissued (looks like they have fixed the reissue delay problem). With that one host reaching time limit errors after <4hrs, might there be something not right with the Benchmark on it so it gives a false estimate? [I started writing this before you hid your computers so it is no longer possible to compare how your other machines have fared or for others more knowledgeable than me to offer insight] I ran one of the tasks you Aborted, coincidentally on my only Linux machine, which finished successfully in 2+1/2 days. Some have finished in 10s of seconds, others look like running to 3 days. With these new 10^7 tasks being potentially much longer than standard, there is clearly an issue with the runtime estimate and the variable 30 seconds to 3 days actual runtime only adds to that confusion. Perhaps they should be confined to -dev until those issues are sorted out as users here expect things to "just work". ID: 39574 · Reply Quote

Magic Quantum Mechanic Send message Joined: 24 Oct 04 Posts: 1147 Credit: 50,191,275 RAC: 8,281	Message 39603 - Posted: 13 Aug 2019, 21:04:01 UTC Maybe we need a due date change with these so when wingman doesn't even run these tasks we don't have to wait forever to have them timed-out and resent and then wait again. I won't name any names but I see several and I guess it explains why they are "anonymous* ID: 39603 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1335 Credit: 8,840,799 RAC: 1,093	Message 39604 - Posted: 14 Aug 2019, 6:02:29 UTC - in response to Message 39603. Maybe we need a due date change with these so when wingman doesn't even run these tasks we don't have to wait forever to have them timed-out and resent and then wait again. Not such a good idea, I think, to reduce the 15 days deadline. My longest 10^7-task had a CPU time of 4 days, 2 hours, 26 minutes and 15 seconds. Not all crunchers have dedicated LHC-machines running 24/7 100% cpu. A machine maybe only works during daytime, not allowed to crunch when in use and could have tasks from other projects too. ID: 39604 · Reply Quote

Magic Quantum Mechanic Send message Joined: 24 Oct 04 Posts: 1147 Credit: 50,191,275 RAC: 8,281	Message 39613 - Posted: 14 Aug 2019, 19:49:28 UTC - in response to Message 39604. Maybe we need a due date change with these so when wingman doesn't even run these tasks we don't have to wait forever to have them timed-out and resent and then wait again. Not such a good idea, I think, to reduce the 15 days deadline. My longest 10^7-task had a CPU time of 4 days, 2 hours, 26 minutes and 15 seconds. Not all crunchers have dedicated LHC-machines running 24/7 100% cpu. A machine maybe only works during daytime, not allowed to crunch when in use and could have tasks from other projects too. Well I am only talking about these TEST units and you don't get them unless you have the preferences set to do so. THESE should not be run by members that aren't doing this 24/7 since they are "Tests" and there is not very many when we do have them. I have seen members with several of these Test WU's that have not even contacted the server for the last 10 days. WHY would it take 15 days for us that have the computers to do these FAST? I guarantee that I saw many of these Tests that will just hit that deadline and be sent back eventually. BUT while you are here........why do you suppose some members here get over 10,000 Credits per a single core Theory task the last couple months? And then others who do many more of these same tasks get 350 Credits? And as far as these Test tasks........well maybe they should all be over at -dev so they will get completed in a day or two instead of not at all. ID: 39613 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1335 Credit: 8,840,799 RAC: 1,093	Message 39614 - Posted: 15 Aug 2019, 8:03:33 UTC - in response to Message 39613. BUT while you are here........why do you suppose some members here get over 10,000 Credits per a single core Theory task the last couple months? And then others who do many more of these same tasks get 350 Credits? Wrong thread. It's already discussed in this Theory Application thread, where you was the last poster atm. ID: 39614 · Reply Quote

Alessio Mereghetti Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 29 Feb 16 Posts: 157 Credit: 2,659,975 RAC: 0	Message 39649 - Posted: 19 Aug 2019, 9:21:11 UTC - in response to Message 39614. Hi, sorry for the late reply - just back from vacation. The 10^7 turns jobs are (should be) sent only on sixtracktest (for the time being) due to their duration and not to mess up with regular production. We are planning to go in production with such a long time range of beam dynamics with split jobs (eg 10^7 turns = 10 consecutive jobs * 10^6 turns) instead of only one job. I have asked the scientist submitting these jobs to proceed slowly for not flooding volunteers with so long jobs. The first batch of jobs was sent out with the usual delay_bound of ~1w. We then increased the parameter to 2w, in order to decrease the amount of jobs killed because of the deadline not being met (and wasting useful resources). For instance, the four tasks reported by Crystal Pellet in his thread: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4296&postid=39537#39537 w-c6_job.B1topenergy.b6onIRoff_c6.2052__1__s__62.31_60.32__6.1_8.1__7__70.5_1_sixvf_boinc106_2 w-c6_job.B1topenergy.b6onIRoff_c6.2052__1__s__62.31_60.32__4.1_6.1__7__10.5_1_sixvf_boinc7_0 w-c2_job.B2topenergy.b6onIRon_c2.2052__1__s__62.31_60.32__8.1_10.1__7__34.5_1_sixvf_boinc520_2 w-c2_job.B2topenergy.b6onIRon_c2.2052__1__s__62.31_60.32__6.1_8.1__7__63_1_sixvf_boinc490_2 belong to this second batch. Concerning the WU with the errors reported by mmonnin https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4296&postid=39551#39551 and maeax https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4296&postid=39535#39535: https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=120080417 all tasks had a report deadline 2w after the sent time stamp, and the two failing tasks failed way earlier than that deadline. For https://lhcathome.cern.ch/lhcathome/result.php?resultid=238806725, the issue seems to be related to the rsc_fpops_bound parameter, whereas for the other task https://lhcathome.cern.ch/lhcathome/result.php?resultid=238806726, the issue might be something different - maybe a transient problem with the HD: https://boinc.mundayweb.com/wiki/index.php?title=Process_exited_with_code_2_(0x2,_-254) A. ID: 39649 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1335 Credit: 8,840,799 RAC: 1,093	Message 39651 - Posted: 19 Aug 2019, 10:08:22 UTC - in response to Message 39649. We are planning to go in production with such a long time range of beam dynamics with split jobs (eg 10^7 turns = 10 consecutive jobs * 10^6 turns) instead of only one job. If I understand that right, you have to wait on the result of the first job before you can create the 2nd consecutive job, wait on the return of the 2nd job, before you can create the 3rd consecutive job etc. up to the 10th and last job. ID: 39651 · Reply Quote

Alessio Mereghetti Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 29 Feb 16 Posts: 157 Credit: 2,659,975 RAC: 0	Message 39653 - Posted: 19 Aug 2019, 10:56:55 UTC - in response to Message 39651. exactly - timewise, the final results may come later than going with single 10^7 turns jobs, but the overall BOINC and volunteer processing should be more efficient. In addition, this option could give us the opportunity to resume any job/study from its ending point :) ID: 39653 · Reply Quote

Alessio Mereghetti Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 29 Feb 16 Posts: 157 Credit: 2,659,975 RAC: 0	Message 39662 - Posted: 20 Aug 2019, 10:11:09 UTC - in response to Message 39551. I think that your host has been hit by some very short (successful) tasks (with basically no dynamic aperture, a perfectly physical case) which led the BOINC server to think that the host is super-fast. The FPOPs in the error messages: <core_client_version>7.9.3</core_client_version> <![CDATA[ <message> exceeded elapsed time limit 17457.49 (1920000000.00G/109981.44G)</message> <stderr_txt> </stderr_txt> ]]> are too high to be real: 109981.44G We had a similar issue in 2017: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4273 which led to updating the validator, but I think we need to fine-tune it even further. Thanks for the precious feedback! A. ID: 39662 · Reply Quote

Magic Quantum Mechanic Send message Joined: 24 Oct 04 Posts: 1147 Credit: 50,191,275 RAC: 8,281	Message 39725 - Posted: 25 Aug 2019, 1:24:46 UTC https://lhcathome.cern.ch/lhcathome/results.php?hostid=10480699 ID: 39725 · Reply Quote

Alessio Mereghetti Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 29 Feb 16 Posts: 157 Credit: 2,659,975 RAC: 0	Message 39734 - Posted: 26 Aug 2019, 7:20:22 UTC - in response to Message 39725. Thanks for pointing this out. Not clear what happened (cannot even see the owner of the machine) - it seems like the machine did not even started the others... ID: 39734 · Reply Quote

LHC@home