Message boards : Sixtrack Application : SIXTRACKTEST
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · Next

AuthorMessage
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1417
Credit: 9,441,018
RAC: 1,047
Message 39537 - Posted: 9 Aug 2019, 9:07:56 UTC

@Alessio Mereghetti:

Some long-runners in between these sixtrack test tasks. I have running

w-c6_job.B1topenergy.b6onIRoff_c6.2052__1__s__62.31_60.32__6.1_8.1__7__70.5_1_sixvf_boinc106_2
w-c6_job.B1topenergy.b6onIRoff_c6.2052__1__s__62.31_60.32__4.1_6.1__7__10.5_1_sixvf_boinc7_0
w-c2_job.B2topenergy.b6onIRon_c2.2052__1__s__62.31_60.32__8.1_10.1__7__34.5_1_sixvf_boinc520_2
w-c2_job.B2topenergy.b6onIRon_c2.2052__1__s__62.31_60.32__6.1_8.1__7__63_1_sixvf_boinc490_2

I aborted on another machine three of that type with 10^7 turns.
ID: 39537 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2242
Credit: 173,897,779
RAC: 2,820
Message 39538 - Posted: 9 Aug 2019, 9:44:21 UTC - in response to Message 39537.  

Hi Crystal,
sixtracktest is finishing all tasks so long on this Computer.
https://lhcathome.cern.ch/lhcathome/results.php?hostid=10592132&offset=0&show_names=0&state=0&appid=10
ID: 39538 · Report as offensive     Reply Quote
mmonnin

Send message
Joined: 22 Mar 17
Posts: 62
Credit: 14,576,403
RAC: 10,212
Message 39551 - Posted: 9 Aug 2019, 21:59:49 UTC - in response to Message 39535.  

Have a Computer with only sixtracktest for the Moment:
The wingman had this Error:
exceeded elapsed time limit 13480.39 (1920000000.00G/109981.44G)</message>
https://lhcathome.cern.ch/lhcathome/result.php?resultid=238806725


I have a lot of these today on one PC. Some complete, some error at the same time. Most successful ones complete in a couple of seconds.
<core_client_version>7.9.3</core_client_version>
<![CDATA[
<message>
exceeded elapsed time limit 17457.49 (1920000000.00G/109981.44G)</message>
<stderr_txt>

</stderr_txt>
]]>

They quickly progress to about 50% then go back to 0% and then very slowly progress. Dumped the rest on that PC.
ID: 39551 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1417
Credit: 9,441,018
RAC: 1,047
Message 39556 - Posted: 10 Aug 2019, 7:54:35 UTC - in response to Message 39537.  

w-c2_job.B2topenergy.b6onIRon_c2.2052__1__s__62.31_60.32__6.1_8.1__7__63_1_sixvf_boinc490_2
w-c2_job.B2topenergy.b6onIRon_c2.2052__1__s__62.31_60.32__8.1_10.1__7__34.5_1_sixvf_boinc520_2
w-c6_job.B1topenergy.b6onIRoff_c6.2052__1__s__62.31_60.32__6.1_8.1__7__70.5_1_sixvf_boinc106_2
w-c6_job.B1topenergy.b6onIRoff_c6.2052__1__s__62.31_60.32__4.1_6.1__7__10.5_1_sixvf_boinc7_0
The 4 tasks mentioned before are running now 24 hours and 42 minutes.
Calculating with elapsed time versus progress% they will finish after 77.29, 81.89, 95.77 and 98.47 hours, if not killed by '197 (0x000000C5) EXIT_TIME_LIMIT_EXCEEDED'.
ID: 39556 · Report as offensive     Reply Quote
Profile Ray Murray
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 281
Credit: 11,866,264
RAC: 0
Message 39559 - Posted: 10 Aug 2019, 8:36:21 UTC - in response to Message 39551.  
Last modified: 10 Aug 2019, 8:39:49 UTC

The clue might be in the name. sixtrackTEST, therefore jobs may not work as expected or even at all. If you only want to run reliable jobs then uncheck the "Run test applications" box. You might also consider turning down your work buffer so you don't get so many. Aborting hundreds of tasks isn't helpful.
I have only been able to get 12 of these. Some have been short, some are over 30hrs. The short ones (unstable beam parameters) can be just as useful as the long ones. Any failures can be used to refine the setup and result in better performance of future tasks.
ID: 39559 · Report as offensive     Reply Quote
Profile Robert Pick

Send message
Joined: 1 Dec 05
Posts: 62
Credit: 11,441,610
RAC: 0
Message 39565 - Posted: 10 Aug 2019, 18:18:19 UTC

I received 8 Sixtracktest WU. I also see that some users on this message page have long units of 30 --100 hrs. The 8 units I have are next in line to start. All have a guesstimate of 7d 17:08:55. That is a far cry from what others claim! I've had a few of these last mo. and none of them lasted that long. I hope that holds true for these!!! Pick
ID: 39565 · Report as offensive     Reply Quote
Profile Robert Pick

Send message
Joined: 1 Dec 05
Posts: 62
Credit: 11,441,610
RAC: 0
Message 39567 - Posted: 10 Aug 2019, 19:02:25 UTC

Well one unite ran for 17 Sec, Another foe 47Sec. Maybe the other 4 will be under the time limit! Pick
ID: 39567 · Report as offensive     Reply Quote
mmonnin

Send message
Joined: 22 Mar 17
Posts: 62
Credit: 14,576,403
RAC: 10,212
Message 39572 - Posted: 10 Aug 2019, 23:17:49 UTC - in response to Message 39559.  

The clue might be in the name. sixtrackTEST, therefore jobs may not work as expected or even at all. If you only want to run reliable jobs then uncheck the "Run test applications" box. You might also consider turning down your work buffer so you don't get so many. Aborting hundreds of tasks isn't helpful.
I have only been able to get 12 of these. Some have been short, some are over 30hrs. The short ones (unstable beam parameters) can be just as useful as the long ones. Any failures can be used to refine the setup and result in better performance of future tasks.


I have a buffer of 0.10 days or 2.4 hours. Barely any buffer at all. But the ETA was only several seconds and hundreds downloaded on a 32t system. I can't help that the project did not set realistic ETA for the longer tasks. Instead of wasting my PCs time and electricity for them to just end up in an error state I sent them back for someone else to hopefully complete. I ran my TEST for sixtractTEST so unless you have suggestions to fix the errors on the one PC (others PCs are still running) ya can step down off your pedestal.
ID: 39572 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1417
Credit: 9,441,018
RAC: 1,047
Message 39573 - Posted: 11 Aug 2019, 6:22:41 UTC - in response to Message 39556.  

Calculating with elapsed time versus progress% they will finish after 77.29, 81.89, 95.77 and 98.47 hours, if not killed by '197 (0x000000C5) EXIT_TIME_LIMIT_EXCEEDED'.

After almost 2 days the predictions are now: 73.17, 74.46, 92.54 and 98.12 hours run time.
ID: 39573 · Report as offensive     Reply Quote
Profile Ray Murray
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 281
Credit: 11,866,264
RAC: 0
Message 39574 - Posted: 11 Aug 2019, 9:04:33 UTC - in response to Message 39572.  
Last modified: 11 Aug 2019, 9:38:48 UTC

My apologies, Mmonnin, if you felt I was preaching. That was not my intention. I genuinely want to offer assistance where it is requested and I am able to provide it.
When sixtrack tasks were scarce, here, people would hoard more than their machines were capable of running before deadline and it seemed to take ages for them to be reissued and returned. I have a similar buffer set and limit my, woefully inadequate compared to yours, machines to only accept 5 tasks at a time. Actually, your strategy of Aborting all the spares is quite valid as they will, as you say, have been reissued (looks like they have fixed the reissue delay problem).

With that one host reaching time limit errors after <4hrs, might there be something not right with the Benchmark on it so it gives a false estimate?
[I started writing this before you hid your computers so it is no longer possible to compare how your other machines have fared or for others more knowledgeable than me to offer insight]

I ran one of the tasks you Aborted, coincidentally on my only Linux machine, which finished successfully in 2+1/2 days. Some have finished in 10s of seconds, others look like running to 3 days. With these new 10^7 tasks being potentially much longer than standard, there is clearly an issue with the runtime estimate and the variable 30 seconds to 3 days actual runtime only adds to that confusion. Perhaps they should be confined to -dev until those issues are sorted out as users here expect things to "just work".
ID: 39574 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1172
Credit: 54,708,888
RAC: 12,819
Message 39603 - Posted: 13 Aug 2019, 21:04:01 UTC

Maybe we need a due date change with these so when wingman doesn't even run these tasks we don't have to wait forever to have them timed-out and resent and then wait again.

I won't name any names but I see several and I guess it explains why they are "anonymous*
ID: 39603 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1417
Credit: 9,441,018
RAC: 1,047
Message 39604 - Posted: 14 Aug 2019, 6:02:29 UTC - in response to Message 39603.  

Maybe we need a due date change with these so when wingman doesn't even run these tasks we don't have to wait forever to have them timed-out and resent and then wait again.
Not such a good idea, I think, to reduce the 15 days deadline. My longest 10^7-task had a CPU time of 4 days, 2 hours, 26 minutes and 15 seconds.
Not all crunchers have dedicated LHC-machines running 24/7 100% cpu.
A machine maybe only works during daytime, not allowed to crunch when in use and could have tasks from other projects too.
ID: 39604 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1172
Credit: 54,708,888
RAC: 12,819
Message 39613 - Posted: 14 Aug 2019, 19:49:28 UTC - in response to Message 39604.  

Maybe we need a due date change with these so when wingman doesn't even run these tasks we don't have to wait forever to have them timed-out and resent and then wait again.
Not such a good idea, I think, to reduce the 15 days deadline. My longest 10^7-task had a CPU time of 4 days, 2 hours, 26 minutes and 15 seconds.
Not all crunchers have dedicated LHC-machines running 24/7 100% cpu.
A machine maybe only works during daytime, not allowed to crunch when in use and could have tasks from other projects too.


Well I am only talking about these TEST units and you don't get them unless you have the preferences set to do so.
THESE should not be run by members that aren't doing this 24/7 since they are "Tests" and there is not very many when we do have them.

I have seen members with several of these Test WU's that have not even contacted the server for the last 10 days.
WHY would it take 15 days for us that have the computers to do these FAST?

I guarantee that I saw many of these Tests that will just hit that deadline and be sent back eventually.

BUT while you are here........why do you suppose some members here get over 10,000 Credits per a single core Theory task the last couple months?
And then others who do many more of these same tasks get 350 Credits?

And as far as these Test tasks........well maybe they should all be over at -dev so they will get completed in a day or two instead of not at all.
ID: 39613 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1417
Credit: 9,441,018
RAC: 1,047
Message 39614 - Posted: 15 Aug 2019, 8:03:33 UTC - in response to Message 39613.  

BUT while you are here........why do you suppose some members here get over 10,000 Credits per a single core Theory task the last couple months?
And then others who do many more of these same tasks get 350 Credits?
Wrong thread. It's already discussed in this Theory Application thread, where you was the last poster atm.
ID: 39614 · Report as offensive     Reply Quote
Alessio Mereghetti
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 29 Feb 16
Posts: 157
Credit: 2,659,975
RAC: 0
Message 39649 - Posted: 19 Aug 2019, 9:21:11 UTC - in response to Message 39614.  

Hi,
sorry for the late reply - just back from vacation.

The 10^7 turns jobs are (should be) sent only on sixtracktest (for the time being) due to their duration and not to mess up with regular production. We are planning to go in production with such a long time range of beam dynamics with split jobs (eg 10^7 turns = 10 consecutive jobs * 10^6 turns) instead of only one job.

I have asked the scientist submitting these jobs to proceed slowly for not flooding volunteers with so long jobs. The first batch of jobs was sent out with the usual
delay_bound
of ~1w. We then increased the parameter to 2w, in order to decrease the amount of jobs killed because of the deadline not being met (and wasting useful resources). For instance, the four tasks reported by Crystal Pellet in his thread:
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4296&postid=39537#39537

w-c6_job.B1topenergy.b6onIRoff_c6.2052__1__s__62.31_60.32__6.1_8.1__7__70.5_1_sixvf_boinc106_2
w-c6_job.B1topenergy.b6onIRoff_c6.2052__1__s__62.31_60.32__4.1_6.1__7__10.5_1_sixvf_boinc7_0
w-c2_job.B2topenergy.b6onIRon_c2.2052__1__s__62.31_60.32__8.1_10.1__7__34.5_1_sixvf_boinc520_2
w-c2_job.B2topenergy.b6onIRon_c2.2052__1__s__62.31_60.32__6.1_8.1__7__63_1_sixvf_boinc490_2

belong to this second batch.

Concerning the WU with the errors reported by mmonnin https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4296&postid=39551#39551 and maeax https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4296&postid=39535#39535:
https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=120080417
all tasks had a report deadline 2w after the sent time stamp, and the two failing tasks failed way earlier than that deadline.
For https://lhcathome.cern.ch/lhcathome/result.php?resultid=238806725, the issue seems to be related to the
rsc_fpops_bound
parameter, whereas for the other task https://lhcathome.cern.ch/lhcathome/result.php?resultid=238806726, the issue might be something different - maybe a transient problem with the HD: https://boinc.mundayweb.com/wiki/index.php?title=Process_exited_with_code_2_(0x2,_-254)

A.
ID: 39649 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1417
Credit: 9,441,018
RAC: 1,047
Message 39651 - Posted: 19 Aug 2019, 10:08:22 UTC - in response to Message 39649.  

We are planning to go in production with such a long time range of beam dynamics with split jobs (eg 10^7 turns = 10 consecutive jobs * 10^6 turns) instead of only one job.
If I understand that right, you have to wait on the result of the first job before you can create the 2nd consecutive job,
wait on the return of the 2nd job, before you can create the 3rd consecutive job etc. up to the 10th and last job.
ID: 39651 · Report as offensive     Reply Quote
Alessio Mereghetti
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 29 Feb 16
Posts: 157
Credit: 2,659,975
RAC: 0
Message 39653 - Posted: 19 Aug 2019, 10:56:55 UTC - in response to Message 39651.  

exactly - timewise, the final results may come later than going with single 10^7 turns jobs, but the overall BOINC and volunteer processing should be more efficient.
In addition, this option could give us the opportunity to resume any job/study from its ending point :)
ID: 39653 · Report as offensive     Reply Quote
Alessio Mereghetti
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 29 Feb 16
Posts: 157
Credit: 2,659,975
RAC: 0
Message 39662 - Posted: 20 Aug 2019, 10:11:09 UTC - in response to Message 39551.  

I think that your host has been hit by some very short (successful) tasks (with basically no dynamic aperture, a perfectly physical case) which led the BOINC server to think that the host is super-fast.
The FPOPs in the error messages:
<core_client_version>7.9.3</core_client_version>
<![CDATA[
<message>
exceeded elapsed time limit 17457.49 (1920000000.00G/109981.44G)</message>
<stderr_txt>

</stderr_txt>
]]>

are too high to be real:
109981.44G


We had a similar issue in 2017:
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4273
which led to updating the validator, but I think we need to fine-tune it even further.

Thanks for the precious feedback!
A.
ID: 39662 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1172
Credit: 54,708,888
RAC: 12,819
Message 39725 - Posted: 25 Aug 2019, 1:24:46 UTC

ID: 39725 · Report as offensive     Reply Quote
Alessio Mereghetti
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 29 Feb 16
Posts: 157
Credit: 2,659,975
RAC: 0
Message 39734 - Posted: 26 Aug 2019, 7:20:22 UTC - in response to Message 39725.  

Thanks for pointing this out.

Not clear what happened (cannot even see the owner of the machine) - it seems like the machine did not even started the others...
ID: 39734 · Report as offensive     Reply Quote
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · Next

Message boards : Sixtrack Application : SIXTRACKTEST


©2024 CERN