Very short tasks

Author	Message
Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1450 Credit: 9,747,300 RAC: 593	Message 42468 - Posted: 14 May 2020, 7:28:32 UTC - in response to Message 42460. In your example result it's the science running inside the VM fails with error exit code 1. That sometimes happens. But In your tasks list I see a lot of Errors "EXIT_ABORTED_BY_CLIENT " with the message Process still present 5 min after writing finish file; aborting This often indicates an overloaded system. ID: 42468 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1450 Credit: 9,747,300 RAC: 593	Message 42470 - Posted: 14 May 2020, 8:45:14 UTC - in response to Message 42469. ALTAS and SixTrack tasks run without any problems. Is there anything I can do to 'relieve' the system to run Theory tasks? Maybe assign less CPU cores to the LHC project? You could reduce the number of CPUs in use by BOINC to 87.5% (2 cores free). I suppose you also have a task running on your GPU. Another way is to use an app_config.xml to configure the number of tasks in use by LHC. This file should be placed in the lhc project folder and ReRead by BOINC Manager's option: Read config files. Example app_config.xml <app_config> <project_max_concurrent>8</project_max_concurrent> <app> <name>ATLAS</name> <max_concurrent>2</max_concurrent> <fraction_done_exact/> </app> <app> <name>CMS</name> <max_concurrent>1</max_concurrent> <fraction_done_exact/> </app> <app> <name>sixtrack</name> <max_concurrent>16</max_concurrent> <fraction_done_exact/> </app> <app> <name>sixtracktest</name> <max_concurrent>16</max_concurrent> <fraction_done_exact/> </app> <app> <name>Theory</name> <max_concurrent>6</max_concurrent> </app> <app_version> <app_name>ATLAS</app_name> <plan_class>vbox64_mt_mcore_atlas</plan_class> <avg_ncpus>4.000000</avg_ncpus> <cmdline>--memory_size_mb 6600</cmdline> </app_version> <app_version> <app_name>CMS</app_name> <plan_class>vbox64</plan_class> <avg_ncpus>1.000000</avg_ncpus> <cmdline>--memory_size_mb 2048 --nthreads 1</cmdline> </app_version> <app_version> <app_name>Theory</app_name> <plan_class>vbox64_theory</plan_class> <avg_ncpus>1.000000</avg_ncpus> <cmdline>--memory_size_mb 730 --nthreads 1</cmdline> </app_version> </app_config> ID: 42470 · Reply Quote

Anton Send message Joined: 26 Nov 10 Posts: 11 Credit: 1,435,923 RAC: 0	Message 42471 - Posted: 14 May 2020, 8:54:03 UTC Hi Christoph, The detailed job log indicate the job failure is due to a network connectivity problem - the machine is not able to download some of files from CVMFS network file system for job execution. I am not sure what is the core reason for this, but it also could be weak network connection or firewall configuration. FYI: - the performance statistics for this macine: http://mcplots-dev.cern.ch/production.php?view=user&system=3&userid=608791#10646229 - relevant part of the detailed log (IO error): $ cat pool/failed/2390/2390-1140084-3.tgz.log ===> [runRivet] Wed May 13 19:25:52 UTC 2020 [boinc pp mb-nsd 2360 - - pythia6 6.428 392 100000 3] ... Building rivetvm ... make: Entering directory `/shared/rivetvm' ... /cvmfs/sft.cern.ch/lcg/releases/fjcontrib/1.041-66c72/x86_64-slc6-gcc8-opt/lib/libfastjetcontribfragile.so: file not recognized: Input/output error collect2: error: ld returned 1 exit status make: *** [rivetvm.exe] Error 1 make: Leaving directory `/shared/rivetvm' ERROR: fail to compile rivetvm ID: 42471 · Reply Quote

Crystal Pellet Volunteer moderator Volunteer tester Send message Joined: 14 Jan 10 Posts: 1450 Credit: 9,747,300 RAC: 593	Message 42472 - Posted: 14 May 2020, 9:08:14 UTC - in response to Message 42471. Great Anton, that you're looking in detail to it. Strange is that during VM startup "Probing /cvmfs/sft.cern.ch" is OK ID: 42472 · Reply Quote

Magic Quantum Mechanic Send message Joined: 24 Oct 04 Posts: 1205 Credit: 71,488,942 RAC: 107,901	Message 42474 - Posted: 14 May 2020, 9:36:16 UTC - in response to Message 42473. https://lhcathome.cern.ch/lhcathome/prefs.php?subset=project Christoph you can have a certain one of your pc's put in a separate group (work,home,school) and then set it to not get Theory tasks. As far as your connection if you want to take a quick look you can watch as a new task starts running via your VM Console and you will see it happen in the first 3 minutes of running where it tries to make it to the runRivet It will just end up like this if you have a slow connection. Instead of what you want like this VB tasks here usually need a d/l - u/l speed of 1.5Mbps or better or they will fail most of the time. ID: 42474 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2636 Credit: 277,078,270 RAC: 144,704	Message 42475 - Posted: 14 May 2020, 9:41:19 UTC - in response to Message 42473. Some ideas to be checked. It might be unlikely but if the Theory vdi file is corrupt for some reason the BOINC client should be shut down whenever work allows to do so. Then remove the file. It will be downloaded automatically when you restart BOINC. Is the computer connected via wi-fi? -> not recommended as lots of data has to be transferred regularly. What about the connection to your ISP? Download bandwidth? Upload bandwidth? Typical Latency? It also might be helpful if you make your computers visible for others: https://lhcathome.cern.ch/lhcathome/prefs.php?subset=project This avoids going via mcplots. ID: 42475 · Reply Quote

Greger Send message Joined: 9 Jan 15 Posts: 151 Credit: 431,596,822 RAC: 0	Message 42476 - Posted: 14 May 2020, 10:00:38 UTC - in response to Message 42473. Setting the max concurrent to 0 in the app_config.xml does not seem to work. <app> <name>Theory</name> <max_concurrent>0</max_concurrent> </app> Value in <max_concurrent>0</max_concurrent> need to be 1 or higher. 0 is not accepted as a value and it would ignore that line. ID: 42476 · Reply Quote

Magic Quantum Mechanic Send message Joined: 24 Oct 04 Posts: 1205 Credit: 71,488,942 RAC: 107,901	Message 42482 - Posted: 14 May 2020, 13:51:33 UTC - in response to Message 42477. I put the machine in a separate group and deactivated Theory application for this group. Indeed the network connection is a weak point of this PC, it seems to be the source of the failures. Thanks to all for the input :) You're welcome Christoph Yes you will find that a slow internet connection can be the biggest problem with VB tasks and that can happen with all different versions of isp especially if they throttle down your speed after you use a certain amount of data according to your contract ( I have used DSL and now Satellite doing this for over 9 years with VB tasks) In my case it is when they throttle down my speed after my monthly total is used and since I run VB tasks mine is used in the first 6 hours so I just have to watch closely and do speed tests and watch my Windows 10 Task Manager to see just how fast I am running before I try to start any........and even when I have full speed (up to 30Mbps) I have to be careful starting lots of these VB tasks and do a few at a time. Another thing you can watch in that VM Console is have that box up right after you start a new task so you can catch any of the typical FAIL warnings ( there is one you can just ignore) During page 2 of the VM Console you will see a timer running where it gives you 1min 40 secs to get that part finished before it goes to page 3 and checks the CVMFS (file system) and if it doesn't do that before the 1min 40 secs the next page will be where you see that it failed. Another problem is these tasks can fail to start and get to runRivet but just keep on running for hours and these will just end up as Invalid/computer error tasks....so when you catch them when they start you can abort them and not waste hours of your time. Here is a snapshot of that part I have this goofy Bonus time on my isp account between 2am and 8am so I had to get up early just to post this and start up some of the VB tasks I had waiting to run later after 8am and it is now 6:50am PDT where I am. Good luck ID: 42482 · Reply Quote

LHC@home