Message boards : Number crunching : Very long job
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile rebel9

Send message
Joined: 15 Aug 05
Posts: 1
Credit: 718,585
RAC: 0
Message 43592 - Posted: 11 Nov 2020, 21:53:04 UTC
Last modified: 11 Nov 2020, 21:53:31 UTC

I have a WU that has been running for nearly 8 days, taking 4 cores all the while and is beyond its deadline. Others have finished much more quickly. Is this one broken? Will it finish or should I kill it and will I get any credit for it if I do?

Many thanks.
ID: 43592 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,889,766
RAC: 138,323
Message 43593 - Posted: 11 Nov 2020, 22:27:31 UTC - in response to Message 43592.  

BOINC does not reward results that are returned after deadline + grace period.
This task is most likely lost.

If you want others to check the logfile you should make your computers visible and post a link to the faulty task.
ID: 43593 · Report as offensive     Reply Quote
Matthias Lehmkuhl

Send message
Joined: 15 Jul 05
Posts: 23
Credit: 2,052,895
RAC: 1,302
Message 43605 - Posted: 14 Nov 2020, 9:31:02 UTC

looks like I have to a very long runner too
https://lhcathome.cern.ch/lhcathome/result.php?resultid=283384624
So far I see a CPU time of 7d 18h at 74,6% -> estimated time to go 02d 09 h with a deadline of 15.11.2020 05:58 CET visible at the Client
Progress is growing slowly and CPU usage for the task is at 100%

Is it possible to extend the deadline for the result which is now 16.11.2020, 4:58:17 UTC
Or should I cancel the result when the server deadline is over
Matthias

ID: 43605 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 43612 - Posted: 14 Nov 2020, 20:40:13 UTC - in response to Message 43605.  

Or should I cancel the result when the server deadline is over
The task will stop after 10 days run time.
ID: 43612 · Report as offensive     Reply Quote
Hans Micheelsen
Avatar

Send message
Joined: 10 Aug 11
Posts: 5
Credit: 1,279,291
RAC: 0
Message 43652 - Posted: 19 Nov 2020, 4:17:14 UTC

I have a task running for 4 days and 13 hours. It started reporting 13 hours to complete and now it says 1:39 minutes to complete. Taking something like 5 minutes per second to complete.
I'm getting a little bit frustrated. Will it ever complete the task? I understand that I don't get any credits for it? Deadline is overdue since yesterday. But is my computer power wasted?
I hope you don't use the same kind of timers in the real project. Are you really sure at LHC that you can handle this project?
ID: 43652 · Report as offensive     Reply Quote
Hans Micheelsen
Avatar

Send message
Joined: 10 Aug 11
Posts: 5
Credit: 1,279,291
RAC: 0
Message 43653 - Posted: 19 Nov 2020, 4:28:21 UTC - in response to Message 43652.  

I have a task running for 4 days and 13 hours. It started reporting 13 hours to complete and now it says 1:39 minutes to complete. Taking something like 5 minutes per second to complete.
I'm getting a little bit frustrated. Will it ever complete the task? I understand that I don't get any credits for it? Deadline is overdue since yesterday. But is my computer power wasted?
I hope you don't use the same kind of timers in the real project. Are you really sure at LHC that you can handle this project?


The task is: https://lhcathome.cern.ch/lhcathome/result.php?resultid=288895684
ID: 43653 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,889,766
RAC: 138,323
Message 43654 - Posted: 19 Nov 2020, 6:49:47 UTC - in response to Message 43653.  

VBox Additions should be installed.
Then you would be able to check the output from the VM consoles.
In case of ATLAS console 2 has a more accurate timer based on the logs from the scientific app.

Timers are influenced by the CPU throttle.
Your CPU throttle is set to 50%.
Vbox apps should be set to 100%.
If it is necessary to limit CPU usage on your computer use other methods.
BOINC provides a couple of them.
ID: 43654 · Report as offensive     Reply Quote
Hans Micheelsen
Avatar

Send message
Joined: 10 Aug 11
Posts: 5
Credit: 1,279,291
RAC: 0
Message 43657 - Posted: 21 Nov 2020, 9:48:54 UTC - in response to Message 43654.  
Last modified: 21 Nov 2020, 9:55:18 UTC

VBox Additions should be installed.
Then you would be able to check the output from the VM consoles.
In case of ATLAS console 2 has a more accurate timer based on the logs from the scientific app.

Timers are influenced by the CPU throttle.
Your CPU throttle is set to 50%.
Vbox apps should be set to 100%.
If it is necessary to limit CPU usage on your computer use other methods.
BOINC provides a couple of them.


Allright, I just thought the simulations would use the CPU performance measurement and the computation configuration in the forecast for remaining time.

I deleted the very long task I had, configured to 100 % CPU use. But set to using 50 % of availabe CPU's. I hope thats more correct.
And now I have an Atlas simulation, that started with 13 hours remaining. Now after 1 days of computing, it says 4 hours remaining. And each second in remaining time takes 5 seconds. Lets see how things develop ...

The task is: https://lhcathome.cern.ch/lhcathome/result.php?resultid=289197986

And I have installed VBox Additions. How do I use it?
ID: 43657 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,889,766
RAC: 138,323
Message 43658 - Posted: 21 Nov 2020, 10:18:29 UTC - in response to Message 43657.  

BOINC has no interface to look into the VM.
Hence, it doesn't know anything about the progress of the scientific app and presents unreliable fake values.

Now, if VBox Additions are installed you can select a task in your BOINC Manager and then click on "show VM console".
This opens a window showing console 1 from inside the VM.
Switch through the VM consoles using ALT-F1..ALT-Fn.

ATLAS has a progress monitoring at ALT-F2 (based on statistical values from the scientific logs but better than BOINC monitoring) and a TOP output at ALT-F3.
Other apps show log output at ALT-F2.
ID: 43658 · Report as offensive     Reply Quote
Hans Micheelsen
Avatar

Send message
Joined: 10 Aug 11
Posts: 5
Credit: 1,279,291
RAC: 0
Message 43659 - Posted: 21 Nov 2020, 10:57:57 UTC - in response to Message 43658.  

I activated the Atlas simulation in VBox but nothing happens with Alt+F1, Alt+F2 etc. I've even tried the virtual keyboard. No reactions.
Also the same on the other LHC simulations I have running in Boinc which are Theory simulations.

One clue might be that on the Atlas simulation the virtual screen says This kernel requires an x86-64 CPU, but only detected an i686 CPU. The processor is unsupported in CentOs 7.

My Cpu's are Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
ID: 43659 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,889,766
RAC: 138,323
Message 43660 - Posted: 21 Nov 2020, 13:36:50 UTC - in response to Message 43659.  

The i7-3770 is a 64-bit CPU and has VT-x/VT-d
See:
https://ark.intel.com/content/www/us/en/ark/products/65719/intel-core-i7-3770-processor-8m-cache-up-to-3-90-ghz.html


Your logfile states:
Processor supports HW virtualization: no

This may point out a wrong or missing BIOS setting.

In addition you may check if all software packages are installed as 64-bit versions, especially:
- the OS iself
- BOINC
- VirtualBox

VirtualBox's kernel drivers may need to be recompiled.
ID: 43660 · Report as offensive     Reply Quote
Hans Micheelsen
Avatar

Send message
Joined: 10 Aug 11
Posts: 5
Credit: 1,279,291
RAC: 0
Message 43682 - Posted: 22 Nov 2020, 23:41:49 UTC - in response to Message 43660.  

I've checked. Everything is 64 bit. And is freshly rebuild. In fact I got the lastest kernel and virtualbox right today from Mageia project.
hansmicheelsen@localhost ~]$ uname -a
Linux localhost.localdomain 5.9.10-desktop-1.mga8 #1 SMP Sun Nov 22 13:48:56 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

[hansmicheelsen@localhost ~]$ rpm -qa | grep virtualbox
virtualbox-guest-additions-6.1.16-1.mga8
dkms-virtualbox-6.1.16-1.mga8
virtualbox-kernel-5.9.10-desktop-1.mga8-6.1.16-10.mga8
virtualbox-6.1.16-1.mga8
virtualbox-kernel-desktop-latest-6.1.16-10.mga8

[hansmicheelsen@localhost ~]$ rpm -qa | grep boinc
boinc-client-7.16.12-1.mga8
boinc-manager-7.16.12-1.mga8


Status: After 2 days of work done time remaining is 1 hour 3 minutes. And each second remaining takes 15 second to run. Geee, there is 3 days left before deadline. With 1 hour left to go I'm afraid I won't make it.
I'll switch off Atlas projects for my boinc computing. It's waste of good CPU.
ID: 43682 · Report as offensive     Reply Quote
[AF>Le_Pommier] Jerome_C2005

Send message
Joined: 12 Jul 11
Posts: 93
Credit: 1,129,876
RAC: 7
Message 44151 - Posted: 19 Jan 2021, 9:15:27 UTC - in response to Message 43612.  

Or should I cancel the result when the server deadline is over
The task will stop after 10 days run time.


Is it the case of this task ?

</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>Theory_2390-1133723-170_0_r1199769442_result</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
</message>
]]>

10 days running (on a Intel i9-10910 @ 3.60GHz) and a miserable ending... and no credit...
ID: 44151 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 44152 - Posted: 19 Jan 2021, 9:41:46 UTC - in response to Message 44151.  
Last modified: 19 Jan 2021, 13:30:35 UTC

...
10 days running (on a Intel i9-10910 @ 3.60GHz) and a miserable ending... and no credit...
That sometimes happens.
If you are not watching the VM's Console to see whether the job progress is useful and could abort when not, the system stops the job after 10 days to avoid more wasted time.

In your case it's a pity that somehow the job restarted after 8 days, so we will not know if it could finish in the full 10 days even on your fast host:
2021-01-05 16:12:35 (78303): Guest Log: 16:12:38 CET +01:00 2021-01-05: cranky: [INFO] ===> [runRivet] Tue Jan  5 15:12:37 UTC 2021 [boinc pp jets 8000 250,-,4160 - sherpa 1.4.1 default 100000 170]
2021-01-13 22:02:24 (832): Guest Log: 16:12:38 CET +01:00 2021-01-05: cranky: [INFO] ===> [runRivet] Tue Jan  5 15:12:37 UTC 2021 [boinc pp jets 8000 250,-,4160 - sherpa 1.4.1 default 100000 170]

Sherpa-jobs are known for more issues than jobs using other generators like pythia6 and pythia8.
ID: 44152 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,084,038
RAC: 105,553
Message 44153 - Posted: 19 Jan 2021, 9:46:23 UTC
Last modified: 19 Jan 2021, 9:47:15 UTC

[boinc pp jets 8000 250,-,4160 - sherpa 1.4.1 default 100000 170]
Sherpa's are very difficult. Sometime they finished correct, otherwise is the time limit at 10 days to find a end.
We have to live with this problem, see other threats about Sherpa's.
Sorry, CP, you are faster ;-))
ID: 44153 · Report as offensive     Reply Quote
[AF>Le_Pommier] Jerome_C2005

Send message
Joined: 12 Jul 11
Posts: 93
Credit: 1,129,876
RAC: 7
Message 44170 - Posted: 21 Jan 2021, 15:55:22 UTC

Unfortunately I am under macOS and the console access from a task was done through an extarnal application "cord" (if you click the console button in the boinc manager it is asking for cord to be installed) and this app is discontinued (not maintained / cannot be installed anymore), the website still exists and they recommend to use "freerdp" instead, I tried to install it (with homebrew) but I have no idea how to tell boinc to use that one instead...

I think I could use the "graphic" button and then try to located the log in the webpage of the task, but they are tons... the logs are so complicated to interpred... it's basically a pain in the [biiip].

But if it's all the fault of sherpas, is there a way to ignore sherpa tasks (however these poor unloved guys may be) and run the pythia tasks only ? using some app_config maybe ?
ID: 44170 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,889,766
RAC: 138,323
Message 44171 - Posted: 21 Jan 2021, 16:46:53 UTC - in response to Message 44170.  

Unfortunately I am under macOS and the console access from a task was done through an extarnal application "cord" (if you click the console button in the boinc manager it is asking for cord to be installed) and this app is discontinued (not maintained / cannot be installed anymore), the website still exists and they recommend to use "freerdp" instead, I tried to install it (with homebrew) but I have no idea how to tell boinc to use that one instead...

I'm not 100% sure but this might be hardwired in the BOINC client, hence it should be asked here:
https://github.com/BOINC/boinc




...is there a way to ignore sherpa tasks...

Task input is taken from mcplots.
The data set currently in progress has 70981 different combinations of input parameters and event generators.
2141 (3 %) of them are sherpas (which is an event generator) and not all sherpas are long-runners.
Each set is send out multiple times - the example from your post has #170:
[boinc pp jets 8000 250,-,4160 - sherpa 1.4.1 default 100000 170]

There's no function to ask for a specific parameter set or event generator.
Each computer gets what is at the top position of the task queue.
ID: 44171 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1268
Credit: 8,421,616
RAC: 2,139
Message 44176 - Posted: 21 Jan 2021, 20:38:49 UTC - in response to Message 44170.  

... and they recommend to use "freerdp" instead, I tried to install it (with homebrew) but I have no idea how to tell boinc to use that one instead...
...

I don't know Darwin, but I suppose when you start freerdp it wants to know to which computer and port you want to connect.

You may enter there localhost:portnr. portnr you can find on several places.
The easiest way is in details of the Virtual Machine from VirtualBox Manager: Remote Desktop Server. There is the portnumber noted.
ID: 44176 · Report as offensive     Reply Quote

Message boards : Number crunching : Very long job


©2024 CERN