21) Message boards : ATLAS application : ATLAS native app (Message 31927)
Posted 11 Aug 2017 by Juha
Post:
The client has no knowledge of the app version until it has received at least one task using that app version.
22) Message boards : News : New ATLAS app version released for Linux hosts (Message 31925)
Posted 11 Aug 2017 by Juha
Post:
What kind of hardware requirements these task have, RAM, multi-core?

Right now all I get is:

No tasks are available for ATLAS Simulation
Message from server: VirtualBox is not installed
23) Message boards : Sixtrack Application : "SixTrack Tasks NOT being distributed (Message 31810)
Posted 4 Aug 2017 by Juha
Post:
A valid task in one that the host computed successfully and returned and the task validated ok.

A completed task is valid task that is not outlier.

So with SixTrack it's quite normal to have completed less than valid.

I wouldn't mind a better term but I can't come up with anything that explains the difference but is still short.
24) Message boards : Sixtrack Application : Inconclusive, valid/invalid results (Message 31120)
Posted 26 Jun 2017 by Juha
Post:
Appreciate all your efforts in trying to solve this. I'm a bit of a layman sorry when it comes to the technical talk regarding all this...should I stop crunching sixtrack work?
I have 175 invalid WU's. Will I eventually get the credit or am I wasting time, power etc when I could crunch another project until resolved?


You don't have any invalids tasks, only valid ones and those that are currently inconclusive. Admins can easily trigger revalidation of all tasks as soon as they have the old validator in place.

I am sure you will get credits but a bit patience is needed for the moment.
25) Message boards : Sixtrack Application : exceeded elapsed time limit 30940.80 (180000000.00G/5817.56G) (Message 30985)
Posted 23 Jun 2017 by Juha
Post:
@Eric

If you need help I could take a look at the validator. I'm of no use with the science but I'm good at reading code and finding bugs.
26) Message boards : Sixtrack Application : exceeded elapsed time limit 30940.80 (180000000.00G/5817.56G) (Message 30357)
Posted 15 May 2017 by Juha
Post:
The average processing rate for the x86_64 sse2 version on your host is hundred or so times larger than it should be.

You have had a couple hundred short running tasks. BOINC expects that runtime of task is proportional to its FLOPS estimate. Short running tasks like you have had could have made BOINC think your computer is really super fast.

Projects that have tasks like these are supposed to code their validators so that unusual tasks are marked as runtime outliers. Sixtrack validator seems to have that code (some app versions for my host have Consecutive valid tasks higher than Number of tasks completed) but I think there could be a bug in the code and some short running tasks are not marked runtime outliers and are allowed to influence runtime estimates.

You can help yourself out of this situation by increasing <rsc_fpops_bound> of Sixtrack tasks 1000 times larger or possible even more. Before you edit client_state.xml you must shutdown BOINC client and make sure BOINC Manager or your OS doesn't automatically restart it until you are done with the edits.
27) Message boards : CMS Application : CMS Tasks Failing (Message 30313)
Posted 12 May 2017 by Juha
Post:
There doesn't seem to be a decrease in the number of lhcb pilots on the server I can monitor -- slightly the opposite actually. Number of idle (i.e.queued) pilots seems to be falling slightly.


From what I can tell from the logs, the server has no trouble handing out jobs. It's that the jobs get stuck before even starting.

05/12/17 21:59:41 (pid:4088) Job 3099277.20 set to execute immediately
05/12/17 21:59:41 (pid:4088) Starting a VANILLA universe job with ID: 3099277.20
05/12/17 21:59:41 (pid:4088) IWD: /var/lib/condor/execute/dir_4088
05/12/17 21:59:41 (pid:4088) Renice expr "10" evaluated to 10
05/12/17 21:59:41 (pid:4088) Using wrapper /usr/local/bin/job-wrapper to exec /var/lib/condor/execute/dir_4088/condor_exec.exe 309927720
05/12/17 21:59:41 (pid:4088) Running job as user nobody
05/12/17 21:59:41 (pid:4088) Create_Process succeeded, pid=4092


There's nothing in the running.log.

The process tree starting from the 4092 above looks like this:

inner-wrapper
  job-wrapper
    sleep
  condor_exec.exe
    wget


wget there is a bit surprising, considering there is hardly any network traffic going on. Is there maybe some server having problems? netstat tells there is two connections (two VMs) to lbvobox33.cern.ch at SYN_SENT state. That server isn't responding to web browser.
28) Message boards : Theory Application : Theory's endless looping (Message 30294)
Posted 11 May 2017 by Juha
Post:
Looks like I have one of these.

Condor JobID:  3087024.0
MCPlots JobID: 36498619


===> [runRivet] Thu May 11 11:54:56 EEST 2017 [boinc ee zhad 200 - - sherpa 1.4.5 default 59000 890]


2.71157 pb +- ( 0.0134118 pb = 0.494614 % ) 310000 ( 365433 -> 84.9 % )
integration time:  ( 2m 14s(2m 3s) elapsed / 0s(0s) left )   
2_4__e-__e+__j__j__j__j : 2.71157 pb +- ( 0.0134118 pb = 0.494614 % )  exp. eff: 0.375051 %
  reduce max for 2_4__e-__e+__j__j__j__j to 0.768368 ( eps = 0.001 ) 
Process_Group::CalculateTotalXSec(): Calculate xs for '2_5__e-__e+__j__j__j__j__j' (Comix)
Starting the calculation. Lean back and enjoy ... .
  Exception_Handler::GenerateStackTrace(..): Generating stack trace 
  {
  }
  
  Exception_Handler::SignalHandler: Signal (6) caught. 
     Cannot continue.
  Exception_Handler::GenerateStackTrace(..): Generating stack trace 
  {
  }
  Exception_Handler::GenerateStackTrace(..): Generating stack trace 
  {
  }


Repeated multiple times.

Updating display...
Display update finished (0 histograms, 0 events).


Repeated once per minute or so.

Isn't using any CPU any more. Guess I'll just reset the VM.
29) Message boards : LHCb Application : Incorrect rsc_memory_bound (Message 30140)
Posted 30 Apr 2017 by Juha
Post:
Normally BOINC client tries to measure a task's memory usage by itself. But VirtualBox allocates the memory for VMs in such a way that the client can't see it. So instead the client uses rsc_memory_bound as VM task's memory usage.

Right now LHCb tasks come with rsc_memory_bound = 477 MB but memory_size_mb = 2048 MB. This kind of huge difference can create serious problems, even crashing the host. The client may start so many tasks that they fill the host's RAM entirely.

Please fix ASAP.
30) Message boards : Theory Application : Lost connection to shadow (Message 29892)
Posted 10 Apr 2017 by Juha
Post:
I have one task that was ostensibly running but not using any CPU time. Looking at logs I see this in StarterLog:

04/10/17 07:54:52 (pid:4070) About to exec Post script: /var/lib/condor/execute/dir_4070/tarOutput.sh 2016-563074-880
04/10/17 07:54:52 (pid:4070) Create_Process succeeded, pid=4974
04/10/17 07:54:52 (pid:4070) Process exited, pid=4974, status=0
04/10/17 07:54:53 (pid:4070) Connection to shadow may be lost, will test by sending whoami request.
04/10/17 07:54:53 (pid:4070) condor_write(): Socket closed when trying to write 37 bytes to <188.184.94.254:9618>, fd is 8
04/10/17 07:54:53 (pid:4070) Buf::write(): condor_write() failed
04/10/17 07:54:53 (pid:4070) i/o error result is 0, errno is 0
04/10/17 07:54:53 (pid:4070) Lost connection to shadow, waiting 86300 secs for reconnect
04/10/17 07:55:41 (pid:4070) CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#6718020


And in StartLog:

04/10/17 06:51:10 Changing activity: Idle -> Busy
04/10/17 07:54:28 CCBListener: no activity from CCB server in 3798s; assuming connection is dead.
04/10/17 07:54:28 CCBListener: connection to CCB server vccondor01.cern.ch failed; will try to reconnect in 60 seconds.
04/10/17 07:54:40 PERMISSION DENIED to condor@486149-10452516-20901 from host 10.0.2.15 for command 448 (GIVE_STATE), access level READ: reason: READ authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: 10.0.2.15,10.0.2.15, hostname size = 1, original ip address = 10.0.2.15
04/10/17 07:54:40 DC_AUTHENTICATE: Command not authorized, done!
04/10/17 07:55:29 CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#6718006


BOINC had pre-empted the task for an hour during that time:

2017-04-10 06:51:12 (11184): Guest Log: [INFO] New Job Starting in slot1
2017-04-10 06:51:12 (11184): Guest Log: [INFO] Condor JobID:  2567929.0 in slot1
2017-04-10 06:51:22 (11184): Guest Log: [INFO] MCPlots JobID: 36306186 in slot1
2017-04-10 06:53:42 (11184): VM state change detected. (old = 'running', new = 'paused')
2017-04-10 07:54:16 (11184): VM state change detected. (old = 'paused', new = 'running')
2017-04-10 07:54:56 (11184): Guest Log: [INFO] Job finished in slot1 with 0.


I don't know if the task being pre-empted is significant or not. I've had a similar problem at least once before where it seemed that the problem with connecting to shadow appeared shortly after the task was resumed but then again, I have had lots of tasks being pre-empted and resumed without any problems.

Anyway, the real killer seems to be the 86300 reconnect interval. That's a bit much in BOINC environment. I'd have it try to reconnect once per minute a few times. The host may have been powered down or suspended and might not have network available just yet. If it can't reconnect after, say, ten minutes just give up and power down the VM.


Previous 20


©2024 CERN