Message boards : Theory Application : New Version v300.05
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 · Next

AuthorMessage
Profile Laurence
Project administrator
Project developer

Send message
Joined: 20 Jun 14
Posts: 372
Credit: 238,712
RAC: 0
Message 41285 - Posted: 17 Jan 2020, 9:38:04 UTC

This version again displays the job description in the output.
ID: 41285 · Report as offensive     Reply Quote
Brummig
Avatar

Send message
Joined: 9 Feb 16
Posts: 48
Credit: 537,111
RAC: 0
Message 41287 - Posted: 17 Jan 2020, 10:53:54 UTC - in response to Message 41285.  
Last modified: 17 Jan 2020, 10:54:38 UTC

I've just had two of these, and (once again) both tasks reported they required significantly more CPU time than was available before the deadline. I've aborted them. Is this problem going to be addressed (please)?
ID: 41287 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,943,935
RAC: 137,304
Message 41288 - Posted: 17 Jan 2020, 11:08:05 UTC - in response to Message 41287.  

As always this is a miscalculation caused by the BOINC client/server, not by an arbitrary project.
Don't abort the tasks, just let them run and finish*).
After a few days the BOINC client/server will have enough data to adjust the values used for runtime estimation.


*) If you get a longrunner that doesn't finish within a week (!) feel free to abort it.
ID: 41288 · Report as offensive     Reply Quote
Brummig
Avatar

Send message
Joined: 9 Feb 16
Posts: 48
Credit: 537,111
RAC: 0
Message 41289 - Posted: 17 Jan 2020, 12:06:43 UTC - in response to Message 41288.  
Last modified: 17 Jan 2020, 12:08:20 UTC

The problem is they typically don't finish by the deadline, but when passed to another host that host may complete the task quickly, making leaving them to run a waste of host time and electricity. The tasks that say they will complete in a reasonable time do complete in a reasonable time.
ID: 41289 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,943,935
RAC: 137,304
Message 41290 - Posted: 17 Jan 2020, 12:34:46 UTC - in response to Message 41289.  

Version 300.05 has been introduced this morning with a deadline at 2020-01-28.
Enough time left to find out if the BOINC ETA is a fake or not.
ID: 41290 · Report as offensive     Reply Quote
Brummig
Avatar

Send message
Joined: 9 Feb 16
Posts: 48
Credit: 537,111
RAC: 0
Message 41291 - Posted: 17 Jan 2020, 13:58:45 UTC - in response to Message 41290.  

No, it's not enough time. One Theory task I have at present says it will take over four days of CPU time. The host it is running on is powered up during work hours, ie around eight hours a day, five days a week (using spare CPU cycles is the intention behind BOINC). So four days of CPU time will take 12 working days plus two weekends, ie 16 days. The deadline is in ten days. Ten is quite a bit less than 16, and I haven't even taken into account doing CPU intensive tasks as part of my work. I don't mind if a task genuinely takes four days of CPU time (like CPDN tasks typically do), but the deadline needs to be suitably distant in the future.

I did once leave one of these tasks to run beyond the deadline, but even once past the deadline it was still a couple of days away from completing, so I aborted it. Others on this forum have let these long-runner tasks run and run, only to have them fail. That suggests the solution is to fix the bug, rather than to extend the deadline.
ID: 41291 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,943,935
RAC: 137,304
Message 41292 - Posted: 17 Jan 2020, 15:25:26 UTC - in response to Message 41291.  

Those estimations are always "fake" when a new app version is introduced (like in this case).
And they are not "CPU time".
Hence nobody can guarantee whether a longrunner will finish before the deadline or not.

I agree that this behavior is not nice but it's also not an issue that can be solved at LHC@home.
Instead it would have to be changed in the BOINC code:
https://github.com/BOINC/boinc/issues
ID: 41292 · Report as offensive     Reply Quote
Brummig
Avatar

Send message
Joined: 9 Feb 16
Posts: 48
Credit: 537,111
RAC: 0
Message 41293 - Posted: 17 Jan 2020, 17:00:28 UTC - in response to Message 41292.  

Well you can call it what you like, but the fact remains that those tasks that say they will complete in a reasonable time (a few hours), and stay that way, will complete within a few hours of computer run time. Those that suddenly jump from displaying a few hours to four days will run and run. Some people have let them run and run, only to find they fail, but I just abort them because I don't want to waste time and electricity (at my expense) on them. Typically when resent the receiving host completes them in a fraction of the time my host spent on them, and I'm not the only person seeing that behaviour. Since this wasn't previously a problem, it strongly suggests a bug has been introduced in the latest Theory tasks.
ID: 41293 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 41294 - Posted: 17 Jan 2020, 18:00:34 UTC - in response to Message 41293.  
Last modified: 18 Jan 2020, 8:14:09 UTC

@Brummig:

Already said to you several times: Don't watch in BOINC Manager the Remaining time. It is really useless.
The Theory tasks all have a different run time, so the 4 days you see is just a placeholder jumping in after the task has run a while.
If you don't trust the run times, monitor the consoles, where you can see how many events total have to be done (mostly 100,000) and how many events have been done so far.
Have you monitored a task in the available Consoles when extension Pack is installed?
Have you seen the average run time in MC Production? http://mcplots-dev.cern.ch/production.php?view=revision&rev=2363
Long runners are rare.
ID: 41294 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 41302 - Posted: 18 Jan 2020, 18:01:44 UTC
Last modified: 18 Jan 2020, 18:02:12 UTC

I have one which has been running fro two days. It should completed in another two days. CPU usage is nominal.
Tullio
ID: 41302 · Report as offensive     Reply Quote
Brummig
Avatar

Send message
Joined: 9 Feb 16
Posts: 48
Credit: 537,111
RAC: 0
Message 41312 - Posted: 20 Jan 2020, 17:48:38 UTC - in response to Message 41294.  

@Crystal Pellet:
Yes, I monitored one for some time. There was no evidence of any progress, and after switching back and forth between displaying different information, it settled on saying that it had processed zero of zero events. I aborted it, and it went to another host that completed it in a fraction of the time my host had been chewing on it. Curiously, whilst that task ran frantically doing nothing, two tasks, after being aborted, reported the run time and CPU time as zero. For example, task 259168280 has a start timestamp of 2020-01-13 15:31:52. I aborted it at 16 Jan 2020, 8:32:03 UTC because it jumped to an extreme estimated completion time, but apparently it did absolutely nothing during the time it was supposedly running (a couple of hours). Task 259230427 was sent 14 Jan 2020, 12:55:32 UTC, and aborted 15 Jan 2020, 8:59:17 UTC. That second task has just this in the stderr output:
<core_client_version>7.14.2</core_client_version>
<![CDATA[
<message>
aborted by user</message>
]]>
ID: 41312 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 41313 - Posted: 20 Jan 2020, 18:13:41 UTC - in response to Message 41312.  

Your task 259168280 shows
2020-01-13 17:54:33 (10536): VM state change detected. (old = 'Running', new = 'Paused')
2020-01-13 17:54:33 (10536): Stopping VM.
2020-01-14 08:47:15 (10536): VM did not stop when requested.
2020-01-14 08:47:15 (10536): VM was successfully terminated.
2020-01-14 08:47:56 (20284): Detected: vboxwrapper 26197
2020-01-14 08:47:56 (20284): Detected: BOINC client v7.7

and looks like a fast shutdown without time for the VM to save to contents to disk.
After restart the job probably started from scratch until you aborted it :( It had over 5 hours of cpu-time.
ID: 41313 · Report as offensive     Reply Quote
Profile zepingouin
Avatar

Send message
Joined: 7 Jan 07
Posts: 41
Credit: 15,959,427
RAC: 271
Message 41316 - Posted: 20 Jan 2020, 20:10:03 UTC

Looking into runRivet.log, I got these last two lines:
59400 events processed
  Event 59500 ( 1d 7h 44m 17s elapsed / 21h 36m 12s left ) -> ETA: Tue Jan 21 17:43

Seems to progress flawlessly.
ID: 41316 · Report as offensive     Reply Quote
tullio

Send message
Joined: 19 Feb 08
Posts: 708
Credit: 4,336,250
RAC: 0
Message 41319 - Posted: 21 Jan 2020, 5:44:57 UTC

Failed after 4 days 4 hours 24 min 4 seconds with xfer_error
Tullio
ID: 41319 · Report as offensive     Reply Quote
Profile zepingouin
Avatar

Send message
Joined: 7 Jan 07
Posts: 41
Credit: 15,959,427
RAC: 271
Message 41322 - Posted: 21 Jan 2020, 19:26:20 UTC

Eventually, task 259592915 ended with success after 54 hours.
ID: 41322 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,384,907
RAC: 102,179
Message 41338 - Posted: 23 Jan 2020, 15:52:45 UTC - in response to Message 41319.  
Last modified: 23 Jan 2020, 15:53:10 UTC

Failed after 4 days 4 hours 24 min 4 seconds with xfer_error
Tullio
this morning, same thing here, after 4 days 4 hours 37 minutes !!!

Rather annoying such faulty tasks :-(
ID: 41338 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,384,907
RAC: 102,179
Message 41340 - Posted: 24 Jan 2020, 4:01:49 UTC

the next one:

</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>Theory_2363-930665-16_0_r255393757_result</file_name>
<error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
</message>

failing after 4 days and 4 hours:-(

https://lhcathome.cern.ch/lhcathome/result.php?resultid=259616872

what's going on over there?
ID: 41340 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,384,907
RAC: 102,179
Message 41356 - Posted: 25 Jan 2020, 12:13:14 UTC - in response to Message 41340.  

here the next one:

https://lhcathome.cern.ch/lhcathome/result.php?resultid=259667718

after 4 days 4 hours !!! Quite some waste of CPU time :-(

What the hell is this caused by?
ID: 41356 · Report as offensive     Reply Quote
Erich56

Send message
Joined: 18 Dec 15
Posts: 1686
Credit: 100,384,907
RAC: 102,179
Message 41358 - Posted: 25 Jan 2020, 12:33:48 UTC

since most recently, I am noticing a strange behaviour on all my computers when downloading Theory tasks (all VM) :
Although in the websettings I have the figure 8 for "max # of tasks", each of my hosts downloads only 2 tasks. Why so? Does anyone else experience the same thing?
If not, what's going wrong here?
ID: 41358 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 41359 - Posted: 25 Jan 2020, 13:53:05 UTC - in response to Message 41358.  

since most recently, I am noticing a strange behaviour on all my computers when downloading Theory tasks (all VM) :
Although in the websettings I have the figure 8 for "max # of tasks", each of my hosts downloads only 2 tasks. Why so? Does anyone else experience the same thing?
If not, what's going wrong here?
First answer this one, then your previous post.
It looks like you get twice the setting in your preferences for Max # CPUs, so I suppose you have set there 1.
It's the same old and odd behaviour for that preference setting. Try 'No limit' when you are only running VBox Theory and no ATLAS. You will get 16 tasks then.
ID: 41359 · Report as offensive     Reply Quote
1 · 2 · 3 · 4 · Next

Message boards : Theory Application : New Version v300.05


©2024 CERN