41) Message boards : Theory Application : New version 300.00 (Message 40467)
Posted 15 Nov 2019 by Luigi R.
Post:
Theory Simulation
Unsent 1939
In progress 13821

My notebook gets no more than 1 task.

Tried an app_config with max_project_concurrent = 4 and max_concurrent (for Theory) = 4.
Tried ncpus > 1000.
Tried work queue = 10 days.

Always "No tasks are available for Theory Simulation".

Tried 4 boinc clients.
3 clients got 1 task, 1 client got 0 task...
42) Message boards : ATLAS application : ATLAS vbox version 2.00 (Message 40370)
Posted 7 Nov 2019 by Luigi R.
Post:
Typical config entries

/etc/cvmfs/default.local
CVMFS_REPOSITORIES="atlas.cern.ch,atlas-condb.cern.ch,grid.cern.ch,cernvm-prod.cern.ch"
[...]

It should be checked (by the project team) if CVMFS_SERVER_URL lists at least 4 servers. Then it's very unlikely that all of them fail at the same moment.
There isn't cernvm-prod.cern.ch in my stderr files.


Client side issues could be:
- wrong firewall settings, e.g. closed ports or filtered destinations
- slow DNS resolving
- high load on the router (not the same as high bandwidth usage!) that causes timeouts
Maybe this. My router wifi is not as good as old router + there are Theory VMs, smartphones, paytv, etc...
43) Message boards : ATLAS application : ATLAS vbox version 2.00 (Message 40356)
Posted 4 Nov 2019 by Luigi R.
Post:
Yeah, thank you. Otherwise I can write a bash script that parses stderr.txt and automatically aborts the concerning task when three "Probing /cvmfs/*... Failed!" are raised (and those three lines must be consecutive).
Ok, it should work.

#!/bin/bash

boinc_path="/home/luis/Applicazioni/boinc"
lhc_project_url="https://lhcathome.cern.ch/lhcathome/"
atlas_app_name="ATLAS"
boinccmd="./boinccmd"

function isAtlasTask()
{
	init_data="$boinc_path/slots/$1/init_data.xml"
	if [ -e "$init_data" ]; then
		app_name=$(sed -n 's|[^<]*<app_name>\([^<]*\)</app_name>[^<]*|\1\n|gp' $init_data)
		if [[ "$app_name" == "$atlas_app_name" ]]; then
			return 1
		else
			return 0
		fi
	fi
	return 0
}

slot_dirs=( $(ls "$boinc_path/slots"))
ndirs=${#slot_dirs[@]}
for (( i = 0; i < ndirs; i++ )) do
	isAtlasTask $i
	if [ $? -eq 1 ]; then
		stderr="$boinc_path/slots/$i/stderr.txt"
		c=0
		while IFS= read -r line; do
			if [[ "$line" == *"Probing /cvmfs/"*"... Failed!" ]]; then
				c=$((c+1))
			fi
		done < "$stderr"
		echo "$c probing fails found in $stderr"
		if [ $c -ge 3 ]; then
			boinc_task_state="$boinc_path/slots/$i/boinc_task_state.xml"
			task_name=$(sed -n 's|[^<]*<result_name>\([^<]*\)</result_name>[^<]*|\1\n|gp' $boinc_task_state)
			cd $boinc_path && $boinccmd --task $lhc_project_url $task_name suspend #abort
			echo "$task_name suspended!" #aborted!"
		fi
	fi
done

When my script gets 3 probing fails, it suspends the concerning atlas task.
Edit boinc_path variable to your boinc path.
Edit boinccmd variabile depending on whether you have standalone or service boinc client.
Delete "suspend" and uncomment "abort" if you want a more destructive behaviour.

Call this script every 15 minutes by a command line like this:
watch -n 900 /your_script_path/AtlasProbingFailedCheck.sh
44) Message boards : ATLAS application : ATLAS vbox version 2.00 (Message 40354)
Posted 4 Nov 2019 by Luigi R.
Post:
- If the probe fails, the probe logs the messages like you saw but then continues (I changed this not to fail the job because the probe failure can be temporary)
So, does script check it again or does something else try to download a job just the same? What's the interval?

I think what it happening in your case is that there is a problem with CVMFS which causes the copy of the bootstrap script to hang forever. I can put a timeout around this to avoid blocking the task forever but I'll need to make changes in the VM and make a new app version. I will look into it next week since I am away at a conference this week.
Yeah, thank you. Otherwise I can write a bash script that parses stderr.txt and automatically aborts the concerning task when three "Probing /cvmfs/*... Failed!" are raised (and those three lines must be consecutive).
45) Message boards : ATLAS application : ATLAS vbox version 2.00 (Message 40346)
Posted 31 Oct 2019 by Luigi R.
Post:
I manually aborted 1 doing-nothing task: https://lhcathome.cern.ch/lhcathome/result.php?resultid=250466685
Why did this task not get aborted by client (e.g. exit status: 194 (0x000000C2) EXIT_ABORTED_BY_CLIENT)?
2019-10-31 14:38:31 (2419): Guest Log: Probing /cvmfs/atlas.cern.ch... Failed!
2019-10-31 14:38:31 (2419): Guest Log: Probing /cvmfs/atlas-condb.cern.ch... Failed!
2019-10-31 14:38:31 (2419): Guest Log: Probing /cvmfs/grid.cern.ch... Failed!


2 other tasks still running. ;)
2019-10-31 03:26:08 (30814): Guest Log: Probing /cvmfs/atlas.cern.ch... Failed!
2019-10-31 03:26:08 (30814): Guest Log: Probing /cvmfs/atlas-condb.cern.ch... Failed!
2019-10-31 03:26:11 (30814): Guest Log: Probing /cvmfs/grid.cern.ch... OK
2019-10-31 17:06:19 (1960): Guest Log: Probing /cvmfs/atlas.cern.ch... Failed!
2019-10-31 17:06:19 (1960): Guest Log: Probing /cvmfs/atlas-condb.cern.ch... Failed!
2019-10-31 17:06:23 (1960): Guest Log: Probing /cvmfs/grid.cern.ch... OK


New task running.
2019-10-31 13:57:56 (12291): Guest Log: Probing /cvmfs/atlas-condb.cern.ch... OK
2019-10-31 13:57:59 (12291): Guest Log: Probing /cvmfs/grid.cern.ch... OK
Third probing try missing. :)
46) Message boards : Theory Application : 1 (0x00000001) Unknown error code - what is this? (Message 40272)
Posted 24 Oct 2019 by Luigi R.
Post:
I don't know if it can help:
1) my browser sometimes gets stuck at loading lhcathome.cern.ch too and "The connection has timed out" error comes out. After a while, I reload the page and it's ok
2) boinc client sometimes gets Error 408 on LHC@Home project

Everything was ok about 3 weeks ago. Then I left for vacation and now I have found this. Maybe it's a problem of my internet connection.
47) Message boards : Theory Application : 1 (0x00000001) Unknown error code - what is this? (Message 40256)
Posted 24 Oct 2019 by Luigi R.
Post:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=249519315
https://lhcathome.cern.ch/lhcathome/result.php?resultid=249519030
https://lhcathome.cern.ch/lhcathome/result.php?resultid=249668542
48) Message boards : Number crunching : computation errors (Message 40002)
Posted 22 Sep 2019 by Luigi R.
Post:
CVMFS not found, aborting the job.
https://lhcathome.cern.ch/lhcathome/result.php?resultid=246654334
49) Message boards : Theory Application : New version 263.90 (Message 39975)
Posted 19 Sep 2019 by Luigi R.
Post:
Some theory tasks (idle or not) automatically end after 36 hours.

Otherwise, try this:
1) Set Leave applications in memory while suspended unckecked

2) Suspend your idle tasks, check task slot number (N=0,1,2, etc...)

3a) Open your_boinc_data_directory/slots/N/vbox_checkpoint.xml, replace elapsed_time value with "129570.000000", then save
3b) Open your_boinc_data_directory/slots/N/boinc_task_state.xml, replace checkpoint_elapsed_time value with "129570.000000", then save


4) Resume your idle tasks

They should end within 30 seconds.

P.S. I don't know if you can skip some steps of my procedure. It always worked, so I didn't modify it.
50) Message boards : Theory Application : New version 263.90 (Message 39971)
Posted 19 Sep 2019 by Luigi R.
Post:
It's not a real problem. That host has got 4 cores (8 threads) and 8GB RAM. It can run successfully 5, maybe 6, 1-cpu tasks.
So I set it to run 4x2-cpus VMs to use less RAM.
The best case: all the VMs run at 200% and it uses 8 threads and 4 cores
The worst case: all the VMs run at 100% and it uses 4 threads and 4 cores.
It's working good.
51) Message boards : Theory Application : New version 263.90 (Message 39969)
Posted 19 Sep 2019 by Luigi R.
Post:
I guess maybe Virtualbox crashed.
52) Message boards : Theory Application : New version 263.90 (Message 39968)
Posted 19 Sep 2019 by Luigi R.
Post:
https://lhcathome.cern.ch/lhcathome/result.php?resultid=245995393

Why did this wu fail, why did VM prematurely shut down and why not getting credits for this reason?
53) Message boards : Number crunching : Weird Statistics (Message 39905)
Posted 11 Sep 2019 by Luigi R.
Post:
Their hosts exist.
http://mcplots-dev.cern.ch/production.php?view=user&system=3&userid=584991
http://mcplots-dev.cern.ch/production.php?view=user&system=3&userid=583486

You can delete your account if you want.


So, why should stats be non-sense?

The rest is privacy...


Edit: do you mean avg credit is too high? If yes, maybe is it an effect of user deletion?

Anyway
https://webcache.googleusercontent.com/search?q=cache:NxSjxiHhciwJ:https://lhcathome.cern.ch/lhcathome/show_user.php%3Fuserid%3D583486+&cd=1&hl=it&ct=clnk&gl=it
54) Message boards : Theory Application : New version 263.90 (Message 39864)
Posted 8 Sep 2019 by Luigi R.
Post:
The benchmark results seem to be valid for that CPU.
[...]
OTOH BOINC's credit calculation includes some components to identify outliers or cheats.
As a result the credit reward for a single task should not be treated as stable.
It will need at least a week without any setup changes to get stable values.
Credit drop does not look benchmark-related.

Now my notebook gets 477cr/day per thread, almost exactly 1/n_threads of previous amount.
I can't accept this when there are still hosts that daily get more than ~10k per thread, so I turned it off.
55) Message boards : Theory Application : New version 263.90 (Message 39802)
Posted 2 Sep 2019 by Luigi R.
Post:
[quote]I suppose, you are not satisfied with the credit ??
hihi, with the value 97,40 I wouldn't be satiesfied either :-)
even my old old old AMD Turion Dual-Core ZM-80 (contained in a notebook) produces more than 200 points within about same processing time.

The free ride is over for me. :(
I was using my notebook only because credits were good.
Daily credits were 3700/thread independently of CPU frequency.
56) Message boards : Theory Application : New version 263.90 (Message 39801)
Posted 2 Sep 2019 by Luigi R.
Post:
The measured floating point speed of 0.87 billion ops/sec is that correct?

Run CPU benchmarks from BOINC Manager's Tools menu and system not used for the rest.
It's the correct value at 1.4-1.6GHz and, yes, my notebook is actually set at 1.2GHz, sometimes at 1.8GHz to speed up other tasks.

Here is my benchmark results at various frequencies (on November 2018).

freq    ( flo/  int)
 800MHz ( 452/ 3048)
1000MHz ( 565/ 3831)
1200MHz ( 679/ 4649)
1400MHz ( 792/ 5413)
1600MH  ( 905/ 6188)
1800MHz (1017/ 6942)
2000MHz (1131/ 7727)
2200MHz (1243/ 8484)

Previously mesaured floating point speed was 18.97 GFLOPS that has always been impossible.
https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=122897843
So why credits should depend on that value now?
57) Message boards : Theory Application : New version 263.90 (Message 39796)
Posted 2 Sep 2019 by Luigi R.
Post:
1 CPU server-side.

https://lhcathome.cern.ch/lhcathome/workunit.php?wuid=123161593

Come on!
58) Message boards : Theory Application : New version 263.90 (Message 39784)
Posted 1 Sep 2019 by Luigi R.
Post:
Just reduce the number of CPUs on your web preferences and the rsc_memory_bound is reduced accordingly. I have set 4 CPUs on web site and Boinc reserves 3750 MB memory for each Theory task. I have set 730 MB and 1 CPU core with app_config and that is what the app is actually using.
It worked. I have set 1 CPU and got tasks with 1500MB reserved.


Anyway I have set 8-thread VMs and it's working fine compared to 1 year ago (post 36607).

From BOINC Manager it results:
CPU time = 51.1h
elapsed time = 7.5h
working set size = 6.59GB

Average CPU used: 51.1 / 7.5 = 6.8
I hope BOINC credits would not drop when I run 1 task.
Ratio got bad.

Now avg = 122.1 / 24.0 = 5.1 threads.

There is only 1 running thread right now, so multithreading is still an unreliable option for me.
59) Message boards : Theory Application : New version 263.90 (Message 39781)
Posted 31 Aug 2019 by Luigi R.
Post:
they use the same amount of RAM as before, but due to this (probably unintended) change in the parameters, BOINC thinks the VM needs much more RAM and therefore does not let start more than just a few VMs at a time.

Of course, this is what I meant.


Anyway I have set 8-thread VMs and it's working fine compared to 1 year ago (post 36607).

From BOINC Manager it results:
CPU time = 51.1h
elapsed time = 7.5h
working set size = 6.59GB

Average CPU used: 51.1 / 7.5 = 6.8
I hope BOINC credits would not drop when I run 1 task.
60) Message boards : Theory Application : New version 263.90 (Message 39775)
Posted 31 Aug 2019 by Luigi R.
Post:
I crunch ATLAS + VLHC on another 4-core host that has got 24GB. Is it not enough?
If all the tasks will need 6.5GB, this host would use only 3 cores too.

Could someone get this problem of the too high RAM allocation solved?
It's not an allocation problem for now because VMs do actually use the same amount of RAM as before.
It's about BOINC that calculates how many tasks to run.

7077888000*3 < 24GB
7077888000*4 > 24GB !!!

I don't know if they will use 6.5GB in the future.


Previous 20 · Next 20


©2024 CERN