Thread 'Tasks run 4 days and finish with error'

Author	Message
NOGOOD Send message Joined: 18 Nov 17 Posts: 135 Credit: 59,156,305 RAC: 342	Message 41653 - Posted: 20 Feb 2020, 9:03:57 UTC Hello. This tasks ran 4 days and finished with error: https://lhcathome.cern.ch/lhcathome/result.php?resultid=263272995 https://lhcathome.cern.ch/lhcathome/result.php?resultid=263269392 https://lhcathome.cern.ch/lhcathome/result.php?resultid=263269875 https://lhcathome.cern.ch/lhcathome/result.php?resultid=263268089 Is it normal or I should stop running Theory for some time? ID: 41653 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1554 Credit: 10,100,748 RAC: 2,007	Message 41654 - Posted: 20 Feb 2020, 9:27:28 UTC - in response to Message 41653. It's normal that a task is killed after 100 hours elapsed time to avoid endless running. The first 3 mentioned tasks belong to the list I mentioned here: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4979&postid=41650 The last task sometimes succeeds, but probably needed more time :( ID: 41654 · Reply Quote

CloverField Send message Joined: 17 Oct 06 Posts: 99 Credit: 65,495,649 RAC: 14,361	Message 41960 - Posted: 19 Mar 2020, 23:30:38 UTC Ive also got a bunch of these. Should I just let them run til they fail or should I abort any task with a estimated time of 4 days? ID: 41960 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1554 Credit: 10,100,748 RAC: 2,007	Message 41963 - Posted: 20 Mar 2020, 11:18:59 UTC - in response to Message 41960. Should I just let them run til they fail or should I abort any task with a estimated time of 4 days? BOINC don't know how long the tasks will run. The 100 hours is just a placeholder to show something, but in fact useless. Whether a job has real progress, you could show when highlighting a task in BOINC Manager and tick Show Graphics on the left. You need VirtualBox Extension Pack installed for that. ID: 41963 · Reply Quote

CloverField Send message Joined: 17 Oct 06 Posts: 99 Credit: 65,495,649 RAC: 14,361	Message 41964 - Posted: 20 Mar 2020, 14:44:01 UTC - in response to Message 41963. Looks like its stuck in some sort of loop. It just keeps printing this output. ID: 41964 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1554 Credit: 10,100,748 RAC: 2,007	Message 41965 - Posted: 20 Mar 2020, 15:14:14 UTC - in response to Message 41964. It looks like it will soon be killed due to the time limit of 100 hours. ID: 41965 · Reply Quote

NOGOOD Send message Joined: 18 Nov 17 Posts: 135 Credit: 59,156,305 RAC: 342	Message 41966 - Posted: 20 Mar 2020, 15:23:17 UTC - in response to Message 41965. It looks like it will soon be killed due to the time limit of 100 hours. If tasks need more time to succeed, may be we need time limit of, for example, 200 hours? It is very sad to waste 4 days of computing. ID: 41966 · Reply Quote

Greger Send message Joined: 9 Jan 15 Posts: 151 Credit: 431,596,822 RAC: 0	Message 41967 - Posted: 20 Mar 2020, 17:06:40 UTC - in response to Message 41966. And 4 days would be a waste if it did not succeed in that time. It is a game of users patience, when would we reach our the threshold of keep them running. It could be extended to never ending but users would not accept it. If run a in native application i normally set 7 days no matter what the stage it would say. This would be a fictive number to not deal with these kind of jobs it choose to run in. Running a script to abort known job that are doomed to fail is one way or add a blacklist is another way to deal with it. Most user would probably abort on specific time reached and if it got get to common they would uncheck application. My view is that it would be better if these would fall into Theory Beta or deal with them on separated project as LHC dev. They have a purpose for project but give out bad experience to whole Theory project. Sherpa in known mostly to this but while range of these type of work have show up as endless jobs. Some users would be open to opt-in to these jobs but able to choose on when and with which hardware. Most users would not monitor each task or host daily or weekly. ID: 41967 · Reply Quote

CloverField Send message Joined: 17 Oct 06 Posts: 99 Credit: 65,495,649 RAC: 14,361	Message 41970 - Posted: 21 Mar 2020, 1:17:53 UTC Yeah it kinda feels like I'm wasting my time since about a 3rd of my tasks run for 4 days then fail before completion. ID: 41970 · Reply Quote

Magic Quantum Mechanic Send message Joined: 24 Oct 04 Posts: 1309 Credit: 97,216,314 RAC: 96,118	Message 41972 - Posted: 21 Mar 2020, 8:40:03 UTC - in response to Message 41653. Hello. This tasks ran 4 days and finished with error: https://lhcathome.cern.ch/lhcathome/result.php?resultid=263272995 https://lhcathome.cern.ch/lhcathome/result.php?resultid=263269392 https://lhcathome.cern.ch/lhcathome/result.php?resultid=263269875 https://lhcathome.cern.ch/lhcathome/result.php?resultid=263268089 Is it normal or I should stop running Theory for some time? It is almost always because of the Theory task being a sherpa and you can tell right at the start if you got a sherpa by checking the VM Console when it starts running and can then just abort it and try again for a pythia task or the different versions of herwig Theory tasks. ID: 41972 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1554 Credit: 10,100,748 RAC: 2,007	Message 41973 - Posted: 21 Mar 2020, 10:21:15 UTC Last modified: 21 Mar 2020, 10:22:12 UTC It's not known in advance how long a Theory job will run. If you don't have the time or don't want to babysit the jobs and worrying about 100 hours waste you could consider to reduce the 100 hours max. run time. No guarantee that this is less waste. It's up to you. Make up your mind/math. 88.5% of the jobs are ready within 5 hours (18000 seconds) 96.3% of the jobs are ready within 10 hours (36000 seconds) Disadvantage: you would kill some jobs normally successful between 5/10 hours and 100 hours (extra waste of time) Advantage: The +100 and endless running ones would be killed much earlier so less waste of time. In the projects directory there is a file called: Theory_2019_11_13a.xml Change the value in <job_duration>360000</job_duration> to your needs. BOINC is checking normally the file size, so when reducing with 1 digit, you have to add a digit somewhere else (space in front of the line e.g.) ID: 41973 · Reply Quote

NOGOOD Send message Joined: 18 Nov 17 Posts: 135 Credit: 59,156,305 RAC: 342	Message 41974 - Posted: 21 Mar 2020, 10:53:47 UTC - in response to Message 41972. It is almost always because of the Theory task being a sherpa and you can tell right at the start if you got a sherpa by checking the VM Console This parameter? https://yadi.sk/i/hJZadj_mOzkGXA ID: 41974 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1554 Credit: 10,100,748 RAC: 2,007	Message 41975 - Posted: 21 Mar 2020, 11:04:16 UTC - in response to Message 41974. This parameter? https://yadi.sk/i/hJZadj_mOzkGXA Yes. Almost all of the erroneous tasks heading to run endless are with the Sherpa generator. ID: 41975 · Reply Quote

NOGOOD Send message Joined: 18 Nov 17 Posts: 135 Credit: 59,156,305 RAC: 342	Message 41976 - Posted: 21 Mar 2020, 11:16:36 UTC - in response to Message 41975. Yes. Almost all of the erroneous tasks heading to run endless are with the Sherpa generator. Thank you! ID: 41976 · Reply Quote

CloverField Send message Joined: 17 Oct 06 Posts: 99 Credit: 65,495,649 RAC: 14,361	Message 41977 - Posted: 21 Mar 2020, 15:55:45 UTC All of my long runners are sherpa as well. ID: 41977 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2753 Credit: 303,791,589 RAC: 109,103	Message 41978 - Posted: 21 Mar 2020, 16:30:04 UTC - in response to Message 41977. l sherpas are bad by default. Instead, some of them have an excellent success ratio: [pre]run events attempts success failure lost pp jets 200 - - sherpa 2.2.8 default 2100000 25 21 1 3 pp jets 7000 - - sherpa 2.2.8 default 2157000 25 22 0 3[/pre] It may be a good idea to check the mcplots pages before cancelling a task: http://mcplots-dev.cern.ch/production.php?view=runs&rev=2363&display=all Based on the example tasks from your post you may filter the list using "ee zhad 206 - - sherpa 1.3.1 default". Then decide whether you accept the success ratio or not to give the running task a chance. ID: 41978 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1554 Credit: 10,100,748 RAC: 2,007	Message 41979 - Posted: 21 Mar 2020, 17:41:10 UTC - in response to Message 41978. Not all sherpas are bad by default. Instead, some of them have an excellent success ratio . . . Most of them have a successful result. From the 2141 known parameter combinations with sherpa as generator 1634 have at least 1 success. So 507 have not a valid result so far. Comparing the figures from the last time (maybe 10 days) 2 sherpa's turned from no success to at least 1 success: pp jets 7000 40,-,810 - sherpa 2.2.1 default ppbar jets 1960 17 - sherpa 2.2.5 default ID: 41979 · Reply Quote

NOGOOD Send message Joined: 18 Nov 17 Posts: 135 Credit: 59,156,305 RAC: 342	Message 41980 - Posted: 21 Mar 2020, 17:45:04 UTC - in response to Message 41973. Last modified: 21 Mar 2020, 17:49:27 UTC It's not known in advance how long a Theory job will run. If you don't have the time or don't want to babysit the jobs and worrying about 100 hours waste you could consider to reduce the 100 hours max. run time. No guarantee that this is less waste. It's up to you. Make up your mind/math. 88.5% of the jobs are ready within 5 hours (18000 seconds) 96.3% of the jobs are ready within 10 hours (36000 seconds) Disadvantage: you would kill some jobs normally successful between 5/10 hours and 100 hours (extra waste of time) Advantage: The +100 and endless running ones would be killed much earlier so less waste of time. In the projects directory there is a file called: Theory_2019_11_13a.xml Change the value in <job_duration>360000</job_duration> to your needs. BOINC is checking normally the file size, so when reducing with 1 digit, you have to add a digit somewhere else (space in front of the line e.g.) I would like to increase run time limit, not reduce. In order to let normally running tasks to succeed. Do we know how many hours (maximum) normally running task may need? ID: 41980 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1554 Credit: 10,100,748 RAC: 2,007	Message 41981 - Posted: 21 Mar 2020, 20:29:37 UTC - in response to Message 41980. I would like to increase run time limit, not reduce. In order to let normally running tasks to succeed. Do we know how many hours (maximum) normally running task may need? No, we don't know, but when you want to monitor the success possibility during run time, you could leave out the line with job_duration in the before mentioned xml-file. Also add a line in the options part of cc_config.xml: <dont_check_file_sizes>1</dont_check_file_sizes> Endless tasks where you yourself have to decide: give it a chance or abort. ID: 41981 · Reply Quote

CloverField Send message Joined: 17 Oct 06 Posts: 99 Credit: 65,495,649 RAC: 14,361	Message 42017 - Posted: 1 Apr 2020, 0:20:53 UTC - in response to Message 41981. I dont think thats going to help it looks like it will take 6000+ days for these tasks to finish. All of my long running jobs are still sherpas. ID: 42017 · Reply Quote