1) Message boards : Number crunching : Tips for optimizing BOINC file transfers for LHC@home (Message 46902)
Posted 16 Jun 2022 by Jesse Viviano
Post:
I should have done this earlier, but I finally found out why my old router was causing this trouble. Turning on IPv6 on the old router apparently disabled hardware acceleration since its hardware acceleration engine apparently only works on IPv4. Turning on IPv6 caused the issue since turning on IPv6 requires a sorting step to be added before any other processing can be done. Since the hardware engine was apparently designed for IPv4 only and apparently must be sent all incoming traffic to perform hardware-accelerated IPv4 without a sorting step, everything was then done in software when IPv6 is enabled. IPv6 is very simple to route in software even at gigabit wire speed, but performing NAT on IPv4 which is part of the routing process in a home IPv4 gateway is very expensive to perform in software, causing these problems. Replacing the router with a newer generation model which can accelerate the sorting and the routing of both IPv4 and IPv6 in hardware solved the problem.
2) Message boards : News : Problem writing CMS job results; please avoid CMS tasks until we find the reason (Message 38601)
Posted 22 Apr 2019 by Jesse Viviano
Post:
If the problem is related to the intake of data that was uploaded into the upload server into whatever is processing that data, could you just disable the CMS assimilator so that the uploaded results just become a backlog to process once the problem is fixed instead of getting lost?
3) Message boards : News : Problem writing CMS job results; please avoid CMS tasks until we find the reason (Message 38596)
Posted 21 Apr 2019 by Jesse Viviano
Post:
So what should crunchers with CMS jobs in their queues do with them? Do we abort them or just suspend them if they have not started yet?
4) Message boards : Number crunching : Postponed: VM job unmanageble, restarting later ????? (Message 36832)
Posted 23 Sep 2018 by Jesse Viviano
Post:
I have noticed that quitting BOINC, waiting for all of the tasks to cleanly shut down and get saved to disk, and then restarting BOINC fixes all of the work units affected by this problem. However, it does not prevent new work units from getting this problem.
5) Message boards : Number crunching : Tips for optimizing BOINC file transfers for LHC@home (Message 36733)
Posted 17 Sep 2018 by Jesse Viviano
Post:
One more benefit I found is that since only one work unit could download at a time, the credential server at LHC@home never gets overwhelmed with requests and will always supply credentials. If too many work units start up at the same time, the credential server could only create credentials for a few of them, causing the rest of the work units starting up at the same time to fail.
6) Message boards : Number crunching : Tips for optimizing BOINC file transfers for LHC@home (Message 35423)
Posted 4 Jun 2018 by Jesse Viviano
Post:
It will see one device, but every connection as separate. This is to allow the firewall to block incoming packets except to known connections that originated inside the firewall. It therefore has to keep track of every connection so it knows what packets to allow and route back inside, and to drop everything else.
7) Message boards : Number crunching : Tips for optimizing BOINC file transfers for LHC@home (Message 35402)
Posted 31 May 2018 by Jesse Viviano
Post:
That would only work if you could put the ISP router in bridge mode. Otherwise, the ISP router will still keep track of all connections because it will see all of the different destinations and different source ports and still perform NAT on each connection.
8) Message boards : Number crunching : Tips for optimizing BOINC file transfers for LHC@home (Message 34703)
Posted 19 Mar 2018 by Jesse Viviano
Post:
In my ISP-mandated router, I think that what is going on is that it was designed for lower speed WAN connections, which it can handle fine. However, my ISP decided to offer gigabit, and that is too much for it to handle multiple high speed connections at once at those speeds. It can handle one high speed connection fine at gigabit speeds, but more than that is too much for it to route at one time. My guess is that the large amount of data it is trying to handle at once when handling two or more high speed connections at once pushed all of the routing data out of the cache and into DRAM, which is really slow.

Some cable modems like those which use Intel and Texas Instruments cable modem chips have a hardware routing table in their packet processing engines which can handle a maximum number of IP connections at once, and exhausting it causes a denial of service because no further connections are able to be created until some of those old connections are torn down. (Intel bought Texas Instruments' cable modem business.) I believe that Broadcom's cable modem chips use software processing, but they perform better than the Intel and TI-based cable modems with their poorly designed hardware processors.

You cannot really know what is going on inside it. It could be poor software which could be fixed with a firmware update. It could be poorly designed hardware. It could be hardware which was designed for one set of requirements, but is now being pushed to handle much higher requirements than it is designed for and therefore performs poorly at those higher requirements in some cases.
9) Message boards : Number crunching : Tips for optimizing BOINC file transfers for LHC@home (Message 34702)
Posted 19 Mar 2018 by Jesse Viviano
Post:
The line "Computing errors are eliminated because the low throughput connections each of the LHC@home virtual machines are not squeezed out by two high throughput file transfers running at the same time. The router can finally route these while one high throughput file transfer is going on." should read "Computing errors are eliminated because the low throughput connections each of the LHC@home virtual machines generate are not squeezed out by two high throughput file transfers running at the same time. The router can finally route these while one high throughput file transfer is going on." However, BOINC's time limit for editing this post is expired.
10) Message boards : Number crunching : Tips for optimizing BOINC file transfers for LHC@home (Message 34694)
Posted 18 Mar 2018 by Jesse Viviano
Post:
I noticed that I used to have computing errors and times when my internet connection would fail. I then noticed that BOINC was trying to download two or more very large files at the same time from LHC@home. I used to have to suspend my VirtualBox LHC@home tasks to save them from failing with a compute error during this time until the downloads completed, and those downloads proceeded at a really slow rate. Those two downloads and the other tasks that my router have to manage (e.g. providing a MoCA connection to the TV's DVR) were apparently maxing out my router's CPU or routing hardware. I found out that limiting BOINC to one file transfer at a time solved the issues of temporarily preventing any other Internet activity, compute errors, and slow file transfers for the two large file transfers BOINC is performing.

This can be done by changing two lines in BOINC's cc_config.xml file, which is found in C:\ProgramData\BOINC for most Windows Vista and later computers. I changed the line "<max_file_xfers>8</max_file_xfers>" to "<max_file_xfers>1</max_file_xfers>". I then changed "<max_file_xfers_per_project>2</max_file_xfers_per_project>" to "<max_file_xfers_per_project>1</max_file_xfers_per_project>".

This caused these effects:

  • Large file transfers in BOINC sped up immensely. I am guessing that having only one large transfer going on allowed the routing data for the single large high throughput file transfer to stay in my router's CPU's cache instead of constantly being pushed to DRAM, which is much slower than the cache.
  • Transfers of multiple small files slowed down, because BOINC will only attempt one file transfer per second instead of two per second default.
  • Computing errors are eliminated because the low throughput connections each of the LHC@home virtual machines are not squeezed out by two high throughput file transfers running at the same time. The router can finally route these while one high throughput file transfer is going on.
  • The temporary internet outages caused by multiple large file transfers maxing out the router's CPU ceased.



These tips are most helpful if your ISP mandates a specific router for your service and you have no option to use your own. If you can use your own, then try that first.

11) Message boards : ATLAS application : Uploads of finished tasks not possible since last night (Message 33424)
Posted 17 Dec 2017 by Jesse Viviano
Post:
I have finally been able to upload and report my ATLAS@home task.
12) Message boards : ATLAS application : Uploads of finished tasks not possible since last night (Message 33422)
Posted 16 Dec 2017 by Jesse Viviano
Post:
I think that I have seen other BOINC projects have upload handlers that either automatically start over on failed file uploads, or direct the BOINC client to start the upload at the point where the interruption occurred. Could this project be programmed to do either of these?
13) Message boards : ATLAS application : Uploads of finished tasks not possible since last night (Message 33416)
Posted 16 Dec 2017 by Jesse Viviano
Post:
I have uploaded two of three results that were stuck, but one of them is still stuck with a file locked by file_upload_handler PID=-1 error:
12/16/2017 1:43:11 PM | LHC@home | Started upload of tKSLDmWh3irnDDn7oo6G73TpABFKDmABFKDm9BLKDmABFKDmF5OqMn_1_r815870200_ATLAS_result
12/16/2017 1:43:16 PM | LHC@home | [error] Error reported by file upload server: [tKSLDmWh3irnDDn7oo6G73TpABFKDmABFKDm9BLKDmABFKDmF5OqMn_1_r815870200_ATLAS_result] locked by file_upload_handler PID=-1
12/16/2017 1:43:16 PM | LHC@home | Temporarily failed upload of tKSLDmWh3irnDDn7oo6G73TpABFKDmABFKDm9BLKDmABFKDmF5OqMn_1_r815870200_ATLAS_result: transient upload error
12/16/2017 1:43:16 PM | LHC@home | Backing off 03:39:17 on upload of tKSLDmWh3irnDDn7oo6G73TpABFKDmABFKDm9BLKDmABFKDmF5OqMn_1_r815870200_ATLAS_result
14) Message boards : ATLAS application : Download failures (Message 31732)
Posted 30 Jul 2017 by Jesse Viviano
Post:
Did someone move the ATLAS@home download server to another IP address? I noticed that my BOINC client cannot connect to the download server at all in regards to the ATLAS@home tasks, while it is able to download other tasks. If that is the case, the solution could be to wait for the old DNS entry to expire. However, if someone changed the DNS without moving the ATLAS@home server to the new IP address, then either the DNS server's entry for the ATLAS@home download server needs to be changed back or the ATLAS@home server needs to be moved to the new IP address.
15) Message boards : Number crunching : not sending out SixTrack (Message 28635)
Posted 23 Jan 2017 by Jesse Viviano
Post:
There are two ATLAS feeders. The one for the ATLAS@home project attached to at http://atlasathome.cern.ch/ is up according to http://atlasathome.cern.ch/server_status.php, while the feeder for the ATLAS@home project attached to at https://lhcathome.cern.ch/ATLAS/ is down according to https://lhcathome.cern.ch/ATLAS/server_status.php.
16) Message boards : Number crunching : Max # CPUs vs projects? (Message 28634)
Posted 23 Jan 2017 by Jesse Viviano
Post:
I would think that actually running multicore jobs could be more efficient barring any limitations like poor schedulers that leave cores idle. Because more cores are executing the same process, there are fewer processes fighting over the last level cache(s) and the memory controller(s). Therefore, there is less likelihood that a competing process would cause a cache block eviction, leading to more cache hits so that more work gets done in less time and therefore less memory system overhead due to fewer processes fighting over it and less need to actually go out to DRAM.
17) Message boards : Number crunching : Broken website features. (Message 28528)
Posted 16 Jan 2017 by Jesse Viviano
Post:
The old one is deprecated and is only useful for computers with old BOINC clients unable to use the HTTPS URL or to check the status of the SixTrack daemons that include the assimilators and validators.
18) Message boards : Number crunching : General Work Shortage? (Message 28509)
Posted 15 Jan 2017 by Jesse Viviano
Post:
I have noticed that the make_work_app daemon is marked as "Not Running" on the new LHC@home server status page at https://lhcathome.cern.ch/lhcathome/server_status.php (which is a different page from the old LHC@home 1.0 server status page at http://lhcathomeclassic.cern.ch/sixtrack/server_status.php). (Both server status pages list different daemons on different servers, so both are still useful for now.) Is this related to the problems people are having?
19) Message boards : Number crunching : One my processor is desapear, why ? (Message 28499)
Posted 15 Jan 2017 by Jesse Viviano
Post:
I just realized that I made an error in my advice about letting tasks drain out if you have a malware problem. You should set the No new tasks mode, abort all tasks, report them, capture and submit the malware sample to your antivirus company if you are able to do that, and then reformat your computer if you have a malware problem.
20) Message boards : Number crunching : One my processor is desapear, why ? (Message 28496)
Posted 14 Jan 2017 by Jesse Viviano
Post:
I am not certain what you are writing about. Have you checked to see if there are 4 cores shown when you check the performance tab of the task manager when you press Control + Alt + Delete, click "More details", and then click the Performance tab? If there are 4 cores shown, then something could be wrong with your BOINC preferences, or maybe your computer might just not have enough memory to run 4 tasks at once and satisfy the amount of memory that must be left free. You could also have set the preferences to only use 75% of the CPU, leaving one core free. Another possibility is that your computer could be infected by malware, in which you might have to let all of your tasks drain by putting your projects on "No new tasks" mode, and then reformat your computer. Finally, your CPU is really old. Could one of the cores have failed and therefore be shut down by the power on self test?


Next 20


©2024 CERN