Message boards :
Theory Application :
Very short tasks
Message board moderation
Author | Message |
---|---|
Send message Joined: 2 Apr 20 Posts: 5 Credit: 5,571,567 RAC: 0 |
Hi, on one machine im observing only very short theory tasks with following output: 2020-05-13 21:28:41 (25540): Guest Log: 21:28:42 CEST +02:00 2020-05-13: cranky: [INFO] Container 'runc' finished with status code 1. 2020-05-13 21:28:42 (25540): Guest Log: 21:28:42 CEST +02:00 2020-05-13: cranky: [INFO] Preparing output. 2020-05-13 21:28:42 (25540): Guest Log: [INFO] Job Finished 2020-05-13 21:28:42 (25540): Guest Log: [INFO] Shutting Down. These tasks run under 10 minutes and give ~ 1.35 points. Example: https://lhcathome.cern.ch/lhcathome/result.php?resultid=273049517 On an other machine all Theory tasks run at least 30minutes, typically a couple of hours or even days. Is there anything wrong with my configuration? |
Send message Joined: 14 Jan 10 Posts: 1273 Credit: 8,480,147 RAC: 2,155 |
In your example result it's the science running inside the VM fails with error exit code 1. That sometimes happens. But In your tasks list I see a lot of Errors "EXIT_ABORTED_BY_CLIENT " with the message Process still present 5 min after writing finish file; aborting This often indicates an overloaded system. |
Send message Joined: 2 Apr 20 Posts: 5 Credit: 5,571,567 RAC: 0 |
Hi, thanks for you anwser. In your example result it's the science running inside the VM fails with error exit code 1. That sometimes happens. In my case on one machine this always happens - in the last 24 hours I finished a couple of hundred Tasks with this result. This often indicates an overloaded system. ALTAS and SixTrack tasks run without any problems. Is there anything I can do to 'relieve' the system to run Theory tasks? Maybe assign less CPU cores to the LHC project? |
Send message Joined: 14 Jan 10 Posts: 1273 Credit: 8,480,147 RAC: 2,155 |
ALTAS and SixTrack tasks run without any problems. Is there anything I can do to 'relieve' the system to run Theory tasks? Maybe assign less CPU cores to the LHC project? You could reduce the number of CPUs in use by BOINC to 87.5% (2 cores free). I suppose you also have a task running on your GPU. Another way is to use an app_config.xml to configure the number of tasks in use by LHC. This file should be placed in the lhc project folder and ReRead by BOINC Manager's option: Read config files. Example app_config.xml <app_config> <project_max_concurrent>8</project_max_concurrent> <app> <name>ATLAS</name> <max_concurrent>2</max_concurrent> <fraction_done_exact/> </app> <app> <name>CMS</name> <max_concurrent>1</max_concurrent> <fraction_done_exact/> </app> <app> <name>sixtrack</name> <max_concurrent>16</max_concurrent> <fraction_done_exact/> </app> <app> <name>sixtracktest</name> <max_concurrent>16</max_concurrent> <fraction_done_exact/> </app> <app> <name>Theory</name> <max_concurrent>6</max_concurrent> </app> <app_version> <app_name>ATLAS</app_name> <plan_class>vbox64_mt_mcore_atlas</plan_class> <avg_ncpus>4.000000</avg_ncpus> <cmdline>--memory_size_mb 6600</cmdline> </app_version> <app_version> <app_name>CMS</app_name> <plan_class>vbox64</plan_class> <avg_ncpus>1.000000</avg_ncpus> <cmdline>--memory_size_mb 2048 --nthreads 1</cmdline> </app_version> <app_version> <app_name>Theory</app_name> <plan_class>vbox64_theory</plan_class> <avg_ncpus>1.000000</avg_ncpus> <cmdline>--memory_size_mb 730 --nthreads 1</cmdline> </app_version> </app_config> |
Send message Joined: 26 Nov 10 Posts: 11 Credit: 1,435,923 RAC: 0 |
Hi Christoph, The detailed job log indicate the job failure is due to a network connectivity problem - the machine is not able to download some of files from CVMFS network file system for job execution. I am not sure what is the core reason for this, but it also could be weak network connection or firewall configuration. FYI: - the performance statistics for this macine: http://mcplots-dev.cern.ch/production.php?view=user&system=3&userid=608791#10646229 - relevant part of the detailed log (IO error): $ cat pool/failed/2390/2390-1140084-3.tgz.log ===> [runRivet] Wed May 13 19:25:52 UTC 2020 [boinc pp mb-nsd 2360 - - pythia6 6.428 392 100000 3] ... Building rivetvm ... make: Entering directory `/shared/rivetvm' ... /cvmfs/sft.cern.ch/lcg/releases/fjcontrib/1.041-66c72/x86_64-slc6-gcc8-opt/lib/libfastjetcontribfragile.so: file not recognized: Input/output error collect2: error: ld returned 1 exit status make: *** [rivetvm.exe] Error 1 make: Leaving directory `/shared/rivetvm' ERROR: fail to compile rivetvm |
Send message Joined: 14 Jan 10 Posts: 1273 Credit: 8,480,147 RAC: 2,155 |
Great Anton, that you're looking in detail to it. Strange is that during VM startup "Probing /cvmfs/sft.cern.ch" is OK |
Send message Joined: 2 Apr 20 Posts: 5 Credit: 5,571,567 RAC: 0 |
Hi Anton, thanks for your answer. A network problem could be possible (I will check as soon as I can). In the meantime, is there a possibility to deactivate the theory tasks on this machine? Or do I have to deactivate Theory application in the LHC@home settings for all machines? Setting the max concurrent to 0 in the app_config.xml does not seem to work. <app> <name>Theory</name> <max_concurrent>0</max_concurrent> </app> |
Send message Joined: 24 Oct 04 Posts: 1116 Credit: 49,722,983 RAC: 14,167 |
https://lhcathome.cern.ch/lhcathome/prefs.php?subset=project Christoph you can have a certain one of your pc's put in a separate group (work,home,school) and then set it to not get Theory tasks. As far as your connection if you want to take a quick look you can watch as a new task starts running via your VM Console and you will see it happen in the first 3 minutes of running where it tries to make it to the *runRivet* It will just end up like this if you have a slow connection. Instead of what you want like this VB tasks here usually need a d/l - u/l speed of 1.5Mbps or better or they will fail most of the time. |
Send message Joined: 15 Jun 08 Posts: 2401 Credit: 225,472,734 RAC: 123,765 |
Some ideas to be checked. It might be unlikely but if the Theory vdi file is corrupt for some reason the BOINC client should be shut down whenever work allows to do so. Then remove the file. It will be downloaded automatically when you restart BOINC. Is the computer connected via wi-fi? -> not recommended as lots of data has to be transferred regularly. What about the connection to your ISP? Download bandwidth? Upload bandwidth? Typical Latency? It also might be helpful if you make your computers visible for others: https://lhcathome.cern.ch/lhcathome/prefs.php?subset=project This avoids going via mcplots. |
Send message Joined: 9 Jan 15 Posts: 151 Credit: 431,596,822 RAC: 0 |
Value in <max_concurrent>0</max_concurrent> need to be 1 or higher. 0 is not accepted as a value and it would ignore that line. |
Send message Joined: 2 Apr 20 Posts: 5 Credit: 5,571,567 RAC: 0 |
I put the machine in a seperate group and deactivated Theory application for this group. Indeed the network connection is a weak point of this PC, it seems to be the source of the failures. Thanks to all for the input :) |
Send message Joined: 24 Oct 04 Posts: 1116 Credit: 49,722,983 RAC: 14,167 |
I put the machine in a separate group and deactivated Theory application for this group. You're welcome Christoph Yes you will find that a slow internet connection can be the biggest problem with VB tasks and that can happen with all different versions of isp especially if they throttle down your speed after you use a certain amount of data according to your contract ( I have used DSL and now Satellite doing this for over 9 years with VB tasks) In my case it is when they throttle down my speed after my monthly total is used and since I run VB tasks mine is used in the first 6 hours so I just have to watch closely and do *speed tests* and watch my Windows 10 Task Manager to see just how fast I am running before I try to start any........and even when I have full speed (up to 30Mbps) I have to be careful starting lots of these VB tasks and do a few at a time. Another thing you can watch in that VM Console is have that box up right after you start a new task so you can catch any of the typical FAIL warnings ( there is one you can just ignore) During page 2 of the VM Console you will see a timer running where it gives you 1min 40 secs to get that part finished before it goes to page 3 and checks the CVMFS (file system) and if it doesn't do that before the 1min 40 secs the next page will be where you see that it failed. Another problem is these tasks can fail to start and get to runRivet but just keep on running for hours and these will just end up as Invalid/computer error tasks....so when you catch them when they start you can abort them and not waste hours of your time. Here is a snapshot of that part I have this goofy *Bonus time* on my isp account between 2am and 8am so I had to get up early just to post this and start up some of the VB tasks I had waiting to run later after 8am and it is now 6:50am PDT where I am. Good luck |
©2024 CERN