Message boards : Theory Application : Tasks run 4 days and finish with error
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 6 · Next

AuthorMessage
NOGOOD

Send message
Joined: 18 Nov 17
Posts: 84
Credit: 28,567,224
RAC: 16,142
Message 41653 - Posted: 20 Feb 2020, 9:03:57 UTC

Hello.

This tasks ran 4 days and finished with error:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=263272995
https://lhcathome.cern.ch/lhcathome/result.php?resultid=263269392
https://lhcathome.cern.ch/lhcathome/result.php?resultid=263269875
https://lhcathome.cern.ch/lhcathome/result.php?resultid=263268089

Is it normal or I should stop running Theory for some time?
ID: 41653 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 934
Credit: 6,286,290
RAC: 664
Message 41654 - Posted: 20 Feb 2020, 9:27:28 UTC - in response to Message 41653.  

It's normal that a task is killed after 100 hours elapsed time to avoid endless running.

The first 3 mentioned tasks belong to the list I mentioned here: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4979&postid=41650

The last task sometimes succeeds, but probably needed more time :(
ID: 41654 · Report as offensive     Reply Quote
CloverField

Send message
Joined: 17 Oct 06
Posts: 39
Credit: 9,662,936
RAC: 27,328
Message 41960 - Posted: 19 Mar 2020, 23:30:38 UTC

Ive also got a bunch of these.

Should I just let them run til they fail or should I abort any task with a estimated time of 4 days?
ID: 41960 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 934
Credit: 6,286,290
RAC: 664
Message 41963 - Posted: 20 Mar 2020, 11:18:59 UTC - in response to Message 41960.  

Should I just let them run til they fail or should I abort any task with a estimated time of 4 days?
BOINC don't know how long the tasks will run. The 100 hours is just a placeholder to show something, but in fact useless.
Whether a job has real progress, you could show when highlighting a task in BOINC Manager and tick Show Graphics on the left.
You need VirtualBox Extension Pack installed for that.
ID: 41963 · Report as offensive     Reply Quote
CloverField

Send message
Joined: 17 Oct 06
Posts: 39
Credit: 9,662,936
RAC: 27,328
Message 41964 - Posted: 20 Mar 2020, 14:44:01 UTC - in response to Message 41963.  

Looks like its stuck in some sort of loop. It just keeps printing this output.

ID: 41964 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 934
Credit: 6,286,290
RAC: 664
Message 41965 - Posted: 20 Mar 2020, 15:14:14 UTC - in response to Message 41964.  

It looks like it will soon be killed due to the time limit of 100 hours.
ID: 41965 · Report as offensive     Reply Quote
NOGOOD

Send message
Joined: 18 Nov 17
Posts: 84
Credit: 28,567,224
RAC: 16,142
Message 41966 - Posted: 20 Mar 2020, 15:23:17 UTC - in response to Message 41965.  

It looks like it will soon be killed due to the time limit of 100 hours.

If tasks need more time to succeed, may be we need time limit of, for example, 200 hours?
It is very sad to waste 4 days of computing.
ID: 41966 · Report as offensive     Reply Quote
Gunde

Send message
Joined: 9 Jan 15
Posts: 83
Credit: 331,221,494
RAC: 233,717
Message 41967 - Posted: 20 Mar 2020, 17:06:40 UTC - in response to Message 41966.  

And 4 days would be a waste if it did not succeed in that time.

It is a game of users patience, when would we reach our the threshold of keep them running. It could be extended to never ending but users would not accept it.
If run a in native application i normally set 7 days no matter what the stage it would say. This would be a fictive number to not deal with these kind of jobs it choose to run in.

Running a script to abort known job that are doomed to fail is one way or add a blacklist is another way to deal with it.
Most user would probably abort on specific time reached and if it got get to common they would uncheck application.

My view is that it would be better if these would fall into Theory Beta or deal with them on separated project as LHC dev. They have a purpose for project but give out bad experience to whole Theory project.
Sherpa in known mostly to this but while range of these type of work have show up as endless jobs.
Some users would be open to opt-in to these jobs but able to choose on when and with which hardware. Most users would not monitor each task or host daily or weekly.
ID: 41967 · Report as offensive     Reply Quote
CloverField

Send message
Joined: 17 Oct 06
Posts: 39
Credit: 9,662,936
RAC: 27,328
Message 41970 - Posted: 21 Mar 2020, 1:17:53 UTC

Yeah it kinda feels like I'm wasting my time since about a 3rd of my tasks run for 4 days then fail before completion.
ID: 41970 · Report as offensive     Reply Quote
Profile MAGIC Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 920
Credit: 39,451,311
RAC: 9,251
Message 41972 - Posted: 21 Mar 2020, 8:40:03 UTC - in response to Message 41653.  

Hello.

This tasks ran 4 days and finished with error:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=263272995
https://lhcathome.cern.ch/lhcathome/result.php?resultid=263269392
https://lhcathome.cern.ch/lhcathome/result.php?resultid=263269875
https://lhcathome.cern.ch/lhcathome/result.php?resultid=263268089

Is it normal or I should stop running Theory for some time?


It is almost always because of the Theory task being a *sherpa* and you can tell right at the start if you got a sherpa by checking the VM Console when it starts running and can then just abort it and try again for a *pythia* task or the different versions of *herwig* Theory tasks.
ID: 41972 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 934
Credit: 6,286,290
RAC: 664
Message 41973 - Posted: 21 Mar 2020, 10:21:15 UTC
Last modified: 21 Mar 2020, 10:22:12 UTC

It's not known in advance how long a Theory job will run.
If you don't have the time or don't want to babysit the jobs and worrying about 100 hours waste you could consider to reduce the 100 hours max. run time.
No guarantee that this is less waste. It's up to you. Make up your mind/math.

88.5% of the jobs are ready within 5 hours (18000 seconds)
96.3% of the jobs are ready within 10 hours (36000 seconds)

Disadvantage: you would kill some jobs normally successful between 5/10 hours and 100 hours (extra waste of time)
Advantage: The +100 and endless running ones would be killed much earlier so less waste of time.

In the projects directory there is a file called: Theory_2019_11_13a.xml
Change the value in <job_duration>360000</job_duration> to your needs.
BOINC is checking normally the file size, so when reducing with 1 digit, you have to add a digit somewhere else (space in front of the line e.g.)
ID: 41973 · Report as offensive     Reply Quote
NOGOOD

Send message
Joined: 18 Nov 17
Posts: 84
Credit: 28,567,224
RAC: 16,142
Message 41974 - Posted: 21 Mar 2020, 10:53:47 UTC - in response to Message 41972.  

It is almost always because of the Theory task being a *sherpa* and you can tell right at the start if you got a sherpa by checking the VM Console

This parameter?
https://yadi.sk/i/hJZadj_mOzkGXA
ID: 41974 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 934
Credit: 6,286,290
RAC: 664
Message 41975 - Posted: 21 Mar 2020, 11:04:16 UTC - in response to Message 41974.  

This parameter?
https://yadi.sk/i/hJZadj_mOzkGXA
Yes. Almost all of the erroneous tasks heading to run endless are with the Sherpa generator.
ID: 41975 · Report as offensive     Reply Quote
NOGOOD

Send message
Joined: 18 Nov 17
Posts: 84
Credit: 28,567,224
RAC: 16,142
Message 41976 - Posted: 21 Mar 2020, 11:16:36 UTC - in response to Message 41975.  

Yes. Almost all of the erroneous tasks heading to run endless are with the Sherpa generator.

Thank you!
ID: 41976 · Report as offensive     Reply Quote
CloverField

Send message
Joined: 17 Oct 06
Posts: 39
Credit: 9,662,936
RAC: 27,328
Message 41977 - Posted: 21 Mar 2020, 15:55:45 UTC

All of my long runners are sherpa as well.
ID: 41977 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Avatar

Send message
Joined: 15 Jun 08
Posts: 1428
Credit: 73,025,506
RAC: 105,776
Message 41978 - Posted: 21 Mar 2020, 16:30:04 UTC - in response to Message 41977.  

Not all sherpas are bad by default.
Instead, some of them have an excellent success ratio:
run                                     events   attempts  success  failure  lost
pp jets 200 - - sherpa 2.2.8 default 	2100000  25        21       1        3
pp jets 7000 - - sherpa 2.2.8 default 	2157000  25        22       0        3

It may be a good idea to check the mcplots pages before cancelling a task:
http://mcplots-dev.cern.ch/production.php?view=runs&rev=2363&display=all

Based on the example tasks from your post you may filter the list using "ee zhad 206 - - sherpa 1.3.1 default".
Then decide whether you accept the success ratio or not to give the running task a chance.
ID: 41978 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 934
Credit: 6,286,290
RAC: 664
Message 41979 - Posted: 21 Mar 2020, 17:41:10 UTC - in response to Message 41978.  

Not all sherpas are bad by default.
Instead, some of them have an excellent success ratio . . .
Most of them have a successful result. From the 2141 known parameter combinations with sherpa as generator 1634 have at least 1 success.
So 507 have not a valid result so far. Comparing the figures from the last time (maybe 10 days) 2 sherpa's turned from no success to at least 1 success:
pp jets 7000 40,-,810 - sherpa 2.2.1 default
ppbar jets 1960 17 - sherpa 2.2.5 default
ID: 41979 · Report as offensive     Reply Quote
NOGOOD

Send message
Joined: 18 Nov 17
Posts: 84
Credit: 28,567,224
RAC: 16,142
Message 41980 - Posted: 21 Mar 2020, 17:45:04 UTC - in response to Message 41973.  
Last modified: 21 Mar 2020, 17:49:27 UTC

It's not known in advance how long a Theory job will run.
If you don't have the time or don't want to babysit the jobs and worrying about 100 hours waste you could consider to reduce the 100 hours max. run time.
No guarantee that this is less waste. It's up to you. Make up your mind/math.

88.5% of the jobs are ready within 5 hours (18000 seconds)
96.3% of the jobs are ready within 10 hours (36000 seconds)

Disadvantage: you would kill some jobs normally successful between 5/10 hours and 100 hours (extra waste of time)
Advantage: The +100 and endless running ones would be killed much earlier so less waste of time.

In the projects directory there is a file called: Theory_2019_11_13a.xml
Change the value in <job_duration>360000</job_duration> to your needs.
BOINC is checking normally the file size, so when reducing with 1 digit, you have to add a digit somewhere else (space in front of the line e.g.)

I would like to increase run time limit, not reduce. In order to let normally running tasks to succeed.
Do we know how many hours (maximum) normally running task may need?
ID: 41980 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 934
Credit: 6,286,290
RAC: 664
Message 41981 - Posted: 21 Mar 2020, 20:29:37 UTC - in response to Message 41980.  

I would like to increase run time limit, not reduce. In order to let normally running tasks to succeed.
Do we know how many hours (maximum) normally running task may need?
No, we don't know, but when you want to monitor the success possibility during run time, you could leave out the line with job_duration in the before mentioned xml-file.
Also add a line in the options part of cc_config.xml:
        <dont_check_file_sizes>1</dont_check_file_sizes>

Endless tasks where you yourself have to decide: give it a chance or abort.
ID: 41981 · Report as offensive     Reply Quote
CloverField

Send message
Joined: 17 Oct 06
Posts: 39
Credit: 9,662,936
RAC: 27,328
Message 42017 - Posted: 1 Apr 2020, 0:20:53 UTC - in response to Message 41981.  



I dont think thats going to help it looks like it will take 6000+ days for these tasks to finish.

All of my long running jobs are still sherpas.

ID: 42017 · Report as offensive     Reply Quote
1 · 2 · 3 · 4 . . . 6 · Next

Message boards : Theory Application : Tasks run 4 days and finish with error


©2020 CERN