Message boards : Theory Application : Very short tasks
Message board moderation

To post messages, you must log in.

AuthorMessage
Christoph

Send message
Joined: 2 Apr 20
Posts: 5
Credit: 5,571,567
RAC: 0
Message 42460 - Posted: 13 May 2020, 20:03:39 UTC

Hi,
on one machine im observing only very short theory tasks with following output:

2020-05-13 21:28:41 (25540): Guest Log: 21:28:42 CEST +02:00 2020-05-13: cranky: [INFO] Container 'runc' finished with status code 1.
2020-05-13 21:28:42 (25540): Guest Log: 21:28:42 CEST +02:00 2020-05-13: cranky: [INFO] Preparing output.
2020-05-13 21:28:42 (25540): Guest Log: [INFO] Job Finished
2020-05-13 21:28:42 (25540): Guest Log: [INFO] Shutting Down.

These tasks run under 10 minutes and give ~ 1.35 points.

Example: https://lhcathome.cern.ch/lhcathome/result.php?resultid=273049517

On an other machine all Theory tasks run at least 30minutes, typically a couple of hours or even days.

Is there anything wrong with my configuration?
ID: 42460 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1273
Credit: 8,480,147
RAC: 2,155
Message 42468 - Posted: 14 May 2020, 7:28:32 UTC - in response to Message 42460.  

In your example result it's the science running inside the VM fails with error exit code 1. That sometimes happens.

But In your tasks list I see a lot of Errors "EXIT_ABORTED_BY_CLIENT " with the message
Process still present 5 min after writing finish file; aborting
This often indicates an overloaded system.
ID: 42468 · Report as offensive     Reply Quote
Christoph

Send message
Joined: 2 Apr 20
Posts: 5
Credit: 5,571,567
RAC: 0
Message 42469 - Posted: 14 May 2020, 8:05:08 UTC - in response to Message 42468.  

Hi, thanks for you anwser.

In your example result it's the science running inside the VM fails with error exit code 1. That sometimes happens.

In my case on one machine this always happens - in the last 24 hours I finished a couple of hundred Tasks with this result.

This often indicates an overloaded system.

ALTAS and SixTrack tasks run without any problems. Is there anything I can do to 'relieve' the system to run Theory tasks? Maybe assign less CPU cores to the LHC project?
ID: 42469 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1273
Credit: 8,480,147
RAC: 2,155
Message 42470 - Posted: 14 May 2020, 8:45:14 UTC - in response to Message 42469.  

ALTAS and SixTrack tasks run without any problems. Is there anything I can do to 'relieve' the system to run Theory tasks? Maybe assign less CPU cores to the LHC project?

You could reduce the number of CPUs in use by BOINC to 87.5% (2 cores free).
I suppose you also have a task running on your GPU.
Another way is to use an app_config.xml to configure the number of tasks in use by LHC. This file should be placed in the lhc project folder and ReRead by BOINC Manager's option: Read config files.

Example app_config.xml
<app_config>
<project_max_concurrent>8</project_max_concurrent>
 <app>
  <name>ATLAS</name>
  <max_concurrent>2</max_concurrent>
  <fraction_done_exact/>
 </app>
 <app>
  <name>CMS</name>
  <max_concurrent>1</max_concurrent>
  <fraction_done_exact/>
 </app>
 <app>
  <name>sixtrack</name>
  <max_concurrent>16</max_concurrent>
  <fraction_done_exact/>
 </app>
 <app>
  <name>sixtracktest</name>
  <max_concurrent>16</max_concurrent>
  <fraction_done_exact/>
 </app>
 <app>
  <name>Theory</name>
  <max_concurrent>6</max_concurrent>
 </app>
 <app_version>
  <app_name>ATLAS</app_name>
  <plan_class>vbox64_mt_mcore_atlas</plan_class>
  <avg_ncpus>4.000000</avg_ncpus>
  <cmdline>--memory_size_mb 6600</cmdline>
 </app_version>
 <app_version>
  <app_name>CMS</app_name>
  <plan_class>vbox64</plan_class>
  <avg_ncpus>1.000000</avg_ncpus>
  <cmdline>--memory_size_mb 2048 --nthreads 1</cmdline>
 </app_version>
 <app_version>
  <app_name>Theory</app_name>
  <plan_class>vbox64_theory</plan_class>
  <avg_ncpus>1.000000</avg_ncpus>
  <cmdline>--memory_size_mb 730 --nthreads 1</cmdline>
 </app_version>
</app_config>
ID: 42470 · Report as offensive     Reply Quote
Anton

Send message
Joined: 26 Nov 10
Posts: 11
Credit: 1,435,923
RAC: 0
Message 42471 - Posted: 14 May 2020, 8:54:03 UTC

Hi Christoph,
The detailed job log indicate the job failure is due to a network connectivity problem - the machine is not able to download some of files from CVMFS network file system for job execution. I am not sure what is the core reason for this, but it also could be weak network connection or firewall configuration.

FYI:
- the performance statistics for this macine:
http://mcplots-dev.cern.ch/production.php?view=user&system=3&userid=608791#10646229

- relevant part of the detailed log (IO error):
$ cat pool/failed/2390/2390-1140084-3.tgz.log
===> [runRivet] Wed May 13 19:25:52 UTC 2020 [boinc pp mb-nsd 2360 - - pythia6 6.428 392 100000 3]
...
Building rivetvm ...
make: Entering directory `/shared/rivetvm'
...
/cvmfs/sft.cern.ch/lcg/releases/fjcontrib/1.041-66c72/x86_64-slc6-gcc8-opt/lib/libfastjetcontribfragile.so: file not recognized: Input/output error
collect2: error: ld returned 1 exit status
make: *** [rivetvm.exe] Error 1
make: Leaving directory `/shared/rivetvm'
ERROR: fail to compile rivetvm
ID: 42471 · Report as offensive     Reply Quote
Crystal Pellet
Volunteer moderator
Volunteer tester

Send message
Joined: 14 Jan 10
Posts: 1273
Credit: 8,480,147
RAC: 2,155
Message 42472 - Posted: 14 May 2020, 9:08:14 UTC - in response to Message 42471.  

Great Anton, that you're looking in detail to it.

Strange is that during VM startup "Probing /cvmfs/sft.cern.ch" is OK
ID: 42472 · Report as offensive     Reply Quote
Christoph

Send message
Joined: 2 Apr 20
Posts: 5
Credit: 5,571,567
RAC: 0
Message 42473 - Posted: 14 May 2020, 9:16:03 UTC - in response to Message 42471.  
Last modified: 14 May 2020, 9:18:45 UTC

Hi Anton,
thanks for your answer. A network problem could be possible (I will check as soon as I can).

In the meantime, is there a possibility to deactivate the theory tasks on this machine? Or do I have to deactivate Theory application in the LHC@home settings for all machines?

Setting the max concurrent to 0 in the app_config.xml does not seem to work.

 <app>
  <name>Theory</name>
  <max_concurrent>0</max_concurrent>
 </app>
ID: 42473 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1116
Credit: 49,722,983
RAC: 14,167
Message 42474 - Posted: 14 May 2020, 9:36:16 UTC - in response to Message 42473.  

https://lhcathome.cern.ch/lhcathome/prefs.php?subset=project
Christoph you can have a certain one of your pc's put in a separate group (work,home,school) and then set it to not get Theory tasks.

As far as your connection if you want to take a quick look you can watch as a new task starts running via your VM Console and you will see it happen in the first 3 minutes of running where it tries to make it to the *runRivet*

It will just end up like this if you have a slow connection.


Instead of what you want like this


VB tasks here usually need a d/l - u/l speed of 1.5Mbps or better or they will fail most of the time.
ID: 42474 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2401
Credit: 225,472,734
RAC: 123,765
Message 42475 - Posted: 14 May 2020, 9:41:19 UTC - in response to Message 42473.  

Some ideas to be checked.

It might be unlikely but if the Theory vdi file is corrupt for some reason the BOINC client should be shut down whenever work allows to do so.
Then remove the file. It will be downloaded automatically when you restart BOINC.


Is the computer connected via wi-fi?
-> not recommended as lots of data has to be transferred regularly.


What about the connection to your ISP?
Download bandwidth?
Upload bandwidth?
Typical Latency?


It also might be helpful if you make your computers visible for others:
https://lhcathome.cern.ch/lhcathome/prefs.php?subset=project
This avoids going via mcplots.
ID: 42475 · Report as offensive     Reply Quote
Greger

Send message
Joined: 9 Jan 15
Posts: 151
Credit: 431,596,822
RAC: 0
Message 42476 - Posted: 14 May 2020, 10:00:38 UTC - in response to Message 42473.  


Setting the max concurrent to 0 in the app_config.xml does not seem to work.

 <app>
  <name>Theory</name>
  <max_concurrent>0</max_concurrent>
 </app>


Value in <max_concurrent>0</max_concurrent> need to be 1 or higher. 0 is not accepted as a value and it would ignore that line.
ID: 42476 · Report as offensive     Reply Quote
Christoph

Send message
Joined: 2 Apr 20
Posts: 5
Credit: 5,571,567
RAC: 0
Message 42477 - Posted: 14 May 2020, 11:12:22 UTC - in response to Message 42474.  

I put the machine in a seperate group and deactivated Theory application for this group.

Indeed the network connection is a weak point of this PC, it seems to be the source of the failures.

Thanks to all for the input :)
ID: 42477 · Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 24 Oct 04
Posts: 1116
Credit: 49,722,983
RAC: 14,167
Message 42482 - Posted: 14 May 2020, 13:51:33 UTC - in response to Message 42477.  

I put the machine in a separate group and deactivated Theory application for this group.

Indeed the network connection is a weak point of this PC, it seems to be the source of the failures.

Thanks to all for the input :)


You're welcome Christoph

Yes you will find that a slow internet connection can be the biggest problem with VB tasks and that can happen with all different versions of isp especially if they throttle down your speed after you use a certain amount of data according to your contract ( I have used DSL and now Satellite doing this for over 9 years with VB tasks)

In my case it is when they throttle down my speed after my monthly total is used and since I run VB tasks mine is used in the first 6 hours so I just have to watch closely and do *speed tests* and watch my Windows 10 Task Manager to see just how fast I am running before I try to start any........and even when I have full speed (up to 30Mbps) I have to be careful starting lots of these VB tasks and do a few at a time.

Another thing you can watch in that VM Console is have that box up right after you start a new task so you can catch any of the typical FAIL warnings ( there is one you can just ignore)

During page 2 of the VM Console you will see a timer running where it gives you 1min 40 secs to get that part finished before it goes to page 3 and checks the CVMFS (file system) and if it doesn't do that before the 1min 40 secs the next page will be where you see that it failed.

Another problem is these tasks can fail to start and get to runRivet but just keep on running for hours and these will just end up as Invalid/computer error tasks....so when you catch them when they start you can abort them and not waste hours of your time.

Here is a snapshot of that part


I have this goofy *Bonus time* on my isp account between 2am and 8am so I had to get up early just to post this and start up some of the VB tasks I had waiting to run later after 8am and it is now 6:50am PDT where I am.

Good luck
ID: 42482 · Report as offensive     Reply Quote

Message boards : Theory Application : Very short tasks


©2024 CERN