21) Message boards : ATLAS application : Panda status (Message 36727)
Posted 16 Sep 2018 by m
Post:
Recent tasks here have appeared to complete successfully at this end but sttill show "running" at the Cern end.(Panda) after several hours

https://lhcathome.cern.ch/lhcathome/result.php?resultid=206716615

happened to -dev tasks, too:-

https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2398589

https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2398588

Not sure about this, though:-
"This is trying to run the run_atlas wrapper for the 2nd time,..."
I'm pretty certain all these tasks ran without interruption.
22) Message boards : News : CMS production pause (Message 36479)
Posted 17 Aug 2018 by m
Post:
Another one,,https://lhcathome.cern.ch/lhcathome/result.php?resultid=205073544

In pinciple this doesn't seem to be a new problem; all the current VM projects suffer.- it's just worse.
Tasks seem to fail on restart if the wrapper doesn't "see" that a job been completed..
Previously this information appeared to be "saved" over the shutdown
so a failure only occurred if no task had been completed before the shutdown.(I've got lots of these...)
This "saving" no longer happens, or it's hidden inside the container, so tasks fail.

It's probably more complicated than this, but this is how it seems to behave here.
23) Message boards : Theory Application : New Version 263.70 (Message 35978)
Posted 20 Jul 2018 by m
Post:
the VM still sometimes fails to use the local squid.

Do you still have some VMs with errors?

Yes, in total I have details of three tasks:-
In all cases the proxy was reported as detected, but the VM was not reported as set up to use it.
Entries in the access log are taken to show that the proxy was (or wasn't) actually used.
https://lhcathome.cern.ch/lhcathome/result.php?resultid=200124140 ( Proxy used).
https://lhcathome.cern.ch/lhcathome/result.php?resultid=200145344 ( Proxy not used)
https://lhcathome.cern.ch/lhcathome/result.php?resultid=200221912 ( Proxy used)

The last two are the same host.

The inconsistency is puzzling. I'll have to wait for some more. If anyone else sees these failures it would be interesting to see their results. Hopefully I haven't misread something somewhere...
24) Message boards : Theory Application : New Version 263.70 (Message 35935)
Posted 16 Jul 2018 by m
Post:
VMs are using the local squid again
Working OK so far.
Thanks Laurence.

Maybe I wrote too soon, the VM still sometimes fails to use the local squid.
2018-07-14 03:46:43 (2620): Guest Log: [DEBUG] Detected squid proxy http://192.168.100.137:3128

2018-07-14 03:47:59 (2620): Guest Log: VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE

2018-07-14 03:47:59 (2620): Guest Log: 2.4.4.0 3533 1 25768 6661 3 1 183731 10240000 2 65024 0 15 93.3333 13 21 http://s1cern-cvmfs.openhtc.io/cvmfs/grid.cern.ch DIRECT 0
25) Message boards : Theory Application : New Version 263.70 (Message 35875)
Posted 12 Jul 2018 by m
Post:
VMs are using the local squid again

(3909): Guest Log: VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE
2018-07-11 02:53:10 (3909): Guest Log: 2.4.4.0 3540 1 25728 6631 3 1 183741 10240000 2 65024 0 15 100 0 0 http://s1cern-cvmfs.openhtc.io/cvmfs/grid.cern.ch http://192.168.100.137:3128 1
2018-07-11 02:53:11 (3909): Guest Log: [INFO] Reading volunteer information

Working OK so far.
Thanks Laurence.
26) Message boards : Theory Application : New Version 263.70 (Message 35807)
Posted 7 Jul 2018 by m
Post:
The heartbeat interval is 20mins and it should beat every minute. So the VM is killed if it takes longer than 20mins to boot or has frozen for 20 minutes.


Are the times in the tasks below right? Looks like the timeout is still 10mins and the heartbeat interval is 20 mins, surely I'm misreading this?
The actual failure is probably OK - it tried to use 2 CPU when it shouldn't.

Theory Simulation v263.70 (vbox64_mt_mcore)
x86_64-pc-linux-gnu

2018-07-07 01:00:20 (7559): Detected: Heartbeat check (file: 'heartbeat' every 1200.000000 seconds)
.....
2018-07-07 01:00:28 (7559): Successfully started VM. (PID = '8126')
.....
2018-07-07 01:10:23 (7559): VM Heartbeat file specified, but missing.
2018-07-07 01:10:23 (7559): VM Heartbeat file specified, but missing file system status. (errno = '2')

Another host...


2018-07-07 06:26:32 (2567): Detected: Heartbeat check (file: 'heartbeat' every 1200.000000 seconds)
.....
2018-07-07 06:26:38 (2567): Successfully started VM. (PID = '3049')
.....
2018-07-07 06:36:33 (2567): VM Heartbeat file specified, but missing.
2018-07-07 06:36:33 (2567): VM Heartbeat file specified, but missing file system status. (errno = '2')
27) Message boards : Theory Application : New Version 263.70 (Message 35785)
Posted 5 Jul 2018 by m
Post:
I'm seeing an increase in heartbeat failures, possibly because the failure of the VM to use the local squid is slowing the startup process considerably. Can't be more precise since the history has been deleted.

Could the timeouts be increased to allow for slow(er) hosts, internet connections, etc and all the stuff (AV updates and whatnot) that go on at startup?

Most of the failures are on startup. Since tasks stop and restart at least once here, a lot of time can be wasted when tasks fail on subsequent starts.

Once upon a time... a config change was made to CMS which, as I remember, largely fixed the problem.
Then it was taken away... and never returned; no explanation.
28) Message boards : Theory Application : New Version 263.70 (Message 35781)
Posted 5 Jul 2018 by m
Post:
I'm seeing an increase in heartbeat failures, possibly because the failure of the VM to use the local squid is slowing the startup process considerably. Can't be more precise since the history has been deleted.

Could the timeouts be increased to allow for slow(er) hosts, internet connections, etc and all the stuff (AV updates and whatnot) that go on at startup?
29) Message boards : News : CERN network problem (Message 35112)
Posted 29 Apr 2018 by m
Post:
Mine are still trying... and not getting very far.

ACCESSED SITE....................CONNECT........BYTES......TIME.......USERS

cvmfs02.grid.sinica.edu.tw.......64....................0...............0:06:01....10
30) Message boards : Number crunching : Setting up a local squid cache for a home cluster - old comments (Message 34868)
Posted 3 Apr 2018 by m
Post:
From what I see here,
About 4 weeks ago Laurence changed the bootstrap script that is executed by every Theory, CMS and LHCb VM. This script transfers the proxy setting from your local BOINC client into your starting VM
this works every time, but
and configures the VM internal CVMFS to use the local proxy..
this works roughly 2/3 of the time.
31) Message boards : Number crunching : Setting up a local squid cache for a home cluster - old comments (Message 34866)
Posted 3 Apr 2018 by m
Post:

This should be implemented. If the VM detects a BOINC proxy has been configured with the default squid port, it will try to use it for CVMFS.

sometimes it works...

and sometimes it doesn't...

Checking through the project database for the last few days shows:-

9 hosts ran 24 VM tasks (5 CMS, 5 LHCb and 14 Theory).

1 CMS failed no heartbeat file,

All the remaining 23 tasks detected the proxy correctly.

1 Theory failed to connect on port 80.

Of the remaining 22 tasks, 7 (3 Theory, 3 CMS and 1 LHCb), failed to set
the VM to use the proxy.

Is this what is meant by "try to use" the proxy?
32) Message boards : Number crunching : Setting up a local squid cache for a home cluster - old comments (Message 34810)
Posted 30 Mar 2018 by m
Post:

This should be implemented. If the VM detects a BOINC proxy has been configured with the default squid port, it will try to use it for CVMFS.

Edit: note that this will currently only work with Theory, LHCb and CMS

Sometimes it works...

2018-03-07 22:37:47 (18466): Guest Log: [INFO] Shared directory mounted, enabling vboxmonitor
2018-03-07 22:37:48 (18466): Guest Log: [DEBUG] Detected squid proxy http://192.168.100.137:3128
2018-03-07 22:38:52 (18466): Guest Log: [DEBUG] Testing network connection to cern.ch on port 80

(...time passes...)

2018-03-07 22:38:55 (18466): Guest Log: [DEBUG] Probing CVMFS ...
2018-03-07 22:38:55 (18466): Guest Log: Probing /cvmfs/grid.cern.ch... OK
2018-03-07 22:38:58 (18466): Guest Log: Probing /cvmfs/sft.cern.ch... OK
2018-03-07 22:38:58 (18466): Guest Log: VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE
2018-03-07 22:38:58 (18466): Guest Log: 2.2.0.0 3429 1 20276 5613 3 1 392693 10240001 2 65024 0 15 100 0 0 http://cernvmfs.gridpp.rl.ac.uk/cvmfs/grid.cern.ch http://192.168.100.137:3128 1
2018-03-07 22:39:06 (18466): Guest Log: [INFO] Reading volunteer information
2018-03-07 22:39:06 (18466): Guest Log: [INFO] Volunteer: m (178) Host: 1422

and sometimes it doesn't...

2018-03-07 22:57:36 (19112): Guest Log: [INFO] Mounting the shared directory
2018-03-07 22:57:36 (19112): Guest Log: [INFO] Shared directory mounted, enabling vboxmonitor
2018-03-07 22:57:36 (19112): Guest Log: [DEBUG] Detected squid proxy http://192.168.100.137:3128
2018-03-07 22:58:51 (19112): Guest Log: [DEBUG] Testing network connection to cern.ch on port 80
2018-03-07 22:58:52 (19112): Guest Log: [DEBUG] Connection to cern.ch 80 port [tcp/http] succeeded!

(... more time goes by...)

2018-03-07 22:58:53 (19112): Guest Log: [DEBUG] Probing CVMFS ...
2018-03-07 22:58:54 (19112): Guest Log: Probing /cvmfs/grid.cern.ch... OK
2018-03-07 22:58:58 (19112): Guest Log: Probing /cvmfs/sft.cern.ch... OK
2018-03-07 22:58:58 (19112): Guest Log: VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE
2018-03-07 22:58:58 (19112): Guest Log: 2.2.0.0 3422 1 22264 5613 3 1 392693 10240001 2 65024 0 15 100 0 0 http://cernvmfs.gridpp.rl.ac.uk/cvmfs/grid.cern.ch DIRECT 1
2018-03-07 22:59:04 (19112): Guest Log: [INFO] Reading volunteer information
2018-03-07 22:59:04 (19112): Guest Log: [INFO] Volunteer: m (178) Host: 1422

Clearly, these logs ar a bit old, but the problem remains. At the moment everything is beavering away running sixtrack so there aren't enough VM tasks to get any idea of the success rate.
33) Message boards : Sixtrack Application : Error downloading sixtrack app for linux (Message 34797)
Posted 29 Mar 2018 by m
Post:
... Should I need to change this at my end or should it come with the task.?

It's configured server side.
As long as your clients don't show any error you may leave it as it is.

OK, thanks.
BTW:
Nice to see that the file has been delivered by your proxy.
:-)

But it still doesn't always work.
34) Message boards : Sixtrack Application : Error downloading sixtrack app for linux (Message 34795)
Posted 29 Mar 2018 by m
Post:
..... It could be that the old Sixtrack URL is in a job template or is used by some clients.

Hosts here are all attached via the new URL

LHC@home 29/03/2018 9:30:59 pm URL https://lhcathome.cern.ch/lhcathome/; Computer ID 10386609; resource share 100
and are using it to get work, but using the old URL

1522291255.026 1367 192.168.xxx.xxx TCP_MEM_HIT/200 7166095 GET http://lhcathomeclassic.cern.ch/sixtrack/download/sixtrack_lin64_4630_sse2.linux - NONE/- -
to download executables. Is this right?
Should I need to change this at my end or should it come with the task.?
35) Message boards : Number crunching : Tips for optimizing BOINC file transfers for LHC@home (Message 34700)
Posted 18 Mar 2018 by m
Post:
Excellent, anything that cuts down the "fake failures" is a good idea - many thanks.
It might also help with the limited number of simultaneous connections provided by the router, presumably limited by the size of the router's RAM. For older routers this can be quite low (and difficult to check)
Thanks again.
36) Message boards : News : Task creation delayed - database maintenance (Message 34420)
Posted 19 Feb 2018 by m
Post:
OK here too, now. Thanks.
37) Message boards : CMS Application : Jobs draining for a WMAgent upgrade (Message 34387)
Posted 15 Feb 2018 by m
Post:
Thanks for the warning, Ivan. It's not easy for those who run different combinations of LHC jobs on different hosts to stop only CMS, so will this result in the BOINC server just not sending out CMS jobs (Laurence's fix) ?

I hope so, but I'll remind him just in case.

OK, many thanks.
38) Message boards : CMS Application : Jobs draining for a WMAgent upgrade (Message 34385)
Posted 15 Feb 2018 by m
Post:
Thanks for the warning, Ivan. It's not easy for those who run different combinations of LHC jobs on different hosts to stop only CMS, so will this result in the BOINC server just not sending out CMS jobs (Laurence's fix) ?
39) Message boards : News : Task creation delayed - database maintenance (Message 34344)
Posted 9 Feb 2018 by m
Post:
While our validators have been able to re-validate a majority of tasks ...

Hi Nils,

are you sure that process is finished?

The following tasks are still shown as "validation pending".

https://lhcathome.cern.ch/lhcathome/result.php?resultid=176729959
https://lhcathome.cern.ch/lhcathome/result.php?resultid=176732019

https://lhcathome.cern.ch/lhcathome/result.php?resultid=176766341
https://lhcathome.cern.ch/lhcathome/result.php?resultid=176728806

https://lhcathome.cern.ch/lhcathome/result.php?resultid=176732966
https://lhcathome.cern.ch/lhcathome/result.php?resultid=176732292

https://lhcathome.cern.ch/lhcathome/result.php?resultid=176731924
https://lhcathome.cern.ch/lhcathome/result.php?resultid=176732452

https://lhcathome.cern.ch/lhcathome/result.php?resultid=176733077

querying the database produces "can't find workunit"
40) Message boards : News : Task creation delayed - database maintenance (Message 34329)
Posted 9 Feb 2018 by m
Post:
.... we are back in normal operation since yesterday evening....

Thanks, Nils... but..
I've still got 25 tasks waiting for validation but the database "can't find workunit" Have these been lost?

From a prevoius post "Sorry, if any running tasks have been removed, this is by accident and we apologize for that."

This represents a bit over 146 hours running time and I can't be the only one. I hope that more care will be taken in future to preserve volunteers' work. Not best pleased.


Previous 20 · Next 20


©2024 CERN