hits file upload fails immediately

Author	Message
Ken_g6 Send message Joined: 4 Jul 06 Posts: 7 Credit: 353,959 RAC: 3	Message 49671 - Posted: 1 Mar 2024, 8:38:37 UTC Yes, it's the big one. No, I'm not using any proxy. Ubuntu 20.04 with BOINC 7.16.6. I've tried lots of times but it just won't upload. I found a <file_xfer_debug> flag in cc_config.xml to show more: Fri 01 Mar 2024 01:34:06 AM MST \| LHC@home \| [fxd] starting upload, upload_offset -1 Fri 01 Mar 2024 01:34:06 AM MST \| LHC@home \| Started upload of mhzMDmrq1z4np2BDcpmwOghnABFKDmABFKDm73LSDmi7hKDm0utSzn_0_r514301471_ATLAS_hits Fri 01 Mar 2024 01:34:06 AM MST \| LHC@home \| [file_xfer] URL: http://lhcathome-upload.cern.ch/lhcathome_cgi/file_upload_handler Fri 01 Mar 2024 01:34:08 AM MST \| LHC@home \| [file_xfer] http op done; retval 0 (Success) Fri 01 Mar 2024 01:34:08 AM MST \| LHC@home \| [file_xfer] parsing upload response: <data_server_reply> <status>0</status> <file_size>0</file_size></data_server_reply> Fri 01 Mar 2024 01:34:08 AM MST \| LHC@home \| [file_xfer] parsing status: 0 Fri 01 Mar 2024 01:34:08 AM MST \| LHC@home \| [fxd] starting upload, upload_offset 0 Fri 01 Mar 2024 01:34:09 AM MST \| LHC@home \| [file_xfer] http op done; retval -224 (permanent HTTP error) Fri 01 Mar 2024 01:34:09 AM MST \| LHC@home \| [file_xfer] file transfer status -224 (permanent HTTP error) Fri 01 Mar 2024 01:34:09 AM MST \| LHC@home \| Backing off 04:15:39 on upload of mhzMDmrq1z4np2BDcpmwOghnABFKDmABFKDm73LSDmi7hKDm0utSzn_0_r514301471_ATLAS_hits ID: 49671 · Reply Quote

Senture Send message Joined: 16 May 16 Posts: 4 Credit: 20,019,087 RAC: 2,493	Message 49672 - Posted: 1 Mar 2024, 9:24:24 UTC Got the same problem myself with 2 computers. Interestingly the hits files failing to upload are much larger than the others. Typical results files seem to be in the 870-890mb range while these are over 1.3gb Had more than a dozen smaller results files upload successfully while these have been stuck so I presume it's something on the server side blocking the larger than usual files ID: 49672 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2724 Credit: 297,716,921 RAC: 137,671	Message 49673 - Posted: 1 Mar 2024, 9:34:03 UTC - in response to Message 49671. Last modified: 1 Mar 2024, 9:35:32 UTC Looks like the upload server is OK. Just uploaded 2 HITS files without issues, size > 900 MB each. ATLAS downloads go via the same server. Got a fresh EVNT file also without issues a few minutes ago. The log snippets just tell you what you already know: HTTP connections fail. Eventually not directly related, but faking 100 cores for an i7-6700 CPU might cause unwanted side effects: https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10504663 Owner Ken_g6 Created 31 Oct 2017, 17:33:02 UTC CPU type GenuineIntel Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz [Family 6 Model 94 Stepping 3] Number of processors 100 ID: 49673 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2724 Credit: 297,716,921 RAC: 137,671	Message 49675 - Posted: 1 Mar 2024, 13:05:05 UTC - in response to Message 49672. Those issues are typical for ATLAS tasks configured to run too many events (1000 or even 2000). Looks like there is a batch of them in the queue (got a 2000 eventer myself today). There's not really a solution except that huge #events should not be configured by the submitter. As you can read in another thread there are a couple of volunteers aggressively requesting those tasks contrary to all experience and agreements made in the past. ID: 49675 · Reply Quote

wujj123456 Send message Joined: 14 Sep 08 Posts: 52 Credit: 74,685,292 RAC: 67,692	Message 49676 - Posted: 1 Mar 2024, 17:19:29 UTC - in response to Message 49675. Last modified: 1 Mar 2024, 17:27:11 UTC There's not really a solution except that huge #events should not be configured by the submitter. As you can read in another thread there are a couple of volunteers aggressively requesting those tasks contrary to all experience and agreements made in the past. Well, we want bigger jobs assuming the project server can handle it. If LHC servers can't handle the upload, then of course it shouldn't issue such broken WUs. That's no different than any projects shouldn't release WUs that wouldn't work, big or small. It's just waste of everyone's time and resources. If there were past agreements, does that mean whoever is submitting the batches are either new or not aware of the issues? In the meantime, is the only solution to abort the upload and thus fail these finished tasks? ID: 49676 · Reply Quote

wujj123456 Send message Joined: 14 Sep 08 Posts: 52 Credit: 74,685,292 RAC: 67,692	Message 49677 - Posted: 1 Mar 2024, 17:21:53 UTC All my stuck uploads are around 1.4G and the 800-900MB ones upload without a problem at the same time. I suppose those are 2K event WUs and server is not configured to accept files over some threshold? ID: 49677 · Reply Quote

Saturn911 Send message Joined: 3 Nov 12 Posts: 78 Credit: 183,350,821 RAC: 91,837	Message 49678 - Posted: 1 Mar 2024, 18:03:11 UTC - in response to Message 49677. Last modified: 1 Mar 2024, 18:16:36 UTC Sounds for me like the old upload malfunction of the local squid. Have you set the upload-cache large enough for this gigabyte files? For a temporary workaround try this: Set for "lhcathome-upload.cern.ch" the no proxy option in Boinc-Manager Works here like a charm. See here: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5474&postid=47465 ID: 49678 · Reply Quote

Senture Send message Joined: 16 May 16 Posts: 4 Credit: 20,019,087 RAC: 2,493	Message 49679 - Posted: 1 Mar 2024, 18:42:22 UTC - in response to Message 49678. How would one go about setting this if not using a squid cache? ID: 49679 · Reply Quote

Saturn911 Send message Joined: 3 Nov 12 Posts: 78 Credit: 183,350,821 RAC: 91,837	Message 49680 - Posted: 1 Mar 2024, 18:55:19 UTC - in response to Message 49679. How would one go about setting this if not using a squid cache? Hit and sunk ID: 49680 · Reply Quote

Senture Send message Joined: 16 May 16 Posts: 4 Credit: 20,019,087 RAC: 2,493	Message 49681 - Posted: 1 Mar 2024, 19:04:02 UTC - in response to Message 49680. Hit and sunk That supposed to mean there's no way of doing it? ID: 49681 · Reply Quote

Jonathan Send message Joined: 25 Sep 17 Posts: 99 Credit: 3,425,566 RAC: 0	Message 49682 - Posted: 1 Mar 2024, 19:43:32 UTC - in response to Message 49681. Last modified: 1 Mar 2024, 19:44:28 UTC If you are not using Squid, there is nothing to set. Are you having problems with uploads? ID: 49682 · Reply Quote

Ken_g6 Send message Joined: 4 Jul 06 Posts: 7 Credit: 353,959 RAC: 3	Message 49683 - Posted: 1 Mar 2024, 20:03:10 UTC - in response to Message 49673. Eventually not directly related, but faking 100 cores for an i7-6700 CPU might cause unwanted side effects I like to set number of cores to use, instead of calculating percent of the CPU in my head. I almost never set it to more than 8. Anyway, I got one task in; hopefully this other one will go eventually. ID: 49683 · Reply Quote

Saturn911 Send message Joined: 3 Nov 12 Posts: 78 Credit: 183,350,821 RAC: 91,837	Message 49684 - Posted: 1 Mar 2024, 20:29:31 UTC - in response to Message 49681. That supposed to mean there's no way of doing it? That's right. You shoot and hit, so my boat sinks and you win. ID: 49684 · Reply Quote

wujj123456 Send message Joined: 14 Sep 08 Posts: 52 Credit: 74,685,292 RAC: 67,692	Message 49685 - Posted: 2 Mar 2024, 1:56:46 UTC - in response to Message 49678. I'm not using squid though. I did use a sock5 proxy but bypassing that didn't help either. ID: 49685 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2724 Credit: 297,716,921 RAC: 137,671	Message 49686 - Posted: 2 Mar 2024, 9:11:30 UTC - in response to Message 49685. What makes the project's downloads faster is the local cache Squid provides for HTTP objects. It's proxy comes on top automatically. SOCKS proxies usually don't cache anything, does yours? If not you may try a Squid between your clients and the SOCKS proxy. Either this: Internet <-> local Router <-> SOCKS <-> Squid <-> local clients Or this: Internet <-> local Router <-> Squid <-> local clients ID: 49686 · Reply Quote

computezrmle Volunteer moderator Volunteer developer Volunteer tester Help desk expert Send message Joined: 15 Jun 08 Posts: 2724 Credit: 297,716,921 RAC: 137,671	Message 49688 - Posted: 2 Mar 2024, 14:41:54 UTC Did some tests to get out if an upload size limit exists. It does. :-( Looks like files > 1024 MB do not upload to lhcathome-upload.cern.ch. Still unclear whether the limit is set - at the project server - at the client side, e.g. hardwired or implicitly a libcurl limit Since in any case there will not be a quick solution tasks producing an upload file > 1024 MB are lost and should be cancelled. As for the Squid workaround mentioned in other posts: client_request_buffer_max_size xyz MB During the tests the value xyz was set to 100. Nonetheless, files larger than that but < 1024 MB uploaded fine. Only if the option is not set in squid.conf uploads via Squid stuck. Looks like the option just needs to be there. Squid version used: v6.6 on Linux Other Squid versions (especially 5.x) may behave differently. ID: 49688 · Reply Quote

wujj123456 Send message Joined: 14 Sep 08 Posts: 52 Credit: 74,685,292 RAC: 67,692	Message 49689 - Posted: 2 Mar 2024, 17:29:36 UTC - in response to Message 49686. Just to elaborate on my socks proxy, though I believe we've ruled it out already. You are correct that socks doesn't cache. I use socks not for LHC, but to route all BOINC traffic through it. I need to do traffic shaping for upload because my stupid asymmetric cable broadband has abysmal upload speed. :-( As for squid, I run native and have cvmfs installed on each host with cloudflare CDN config. I used to have squid doing transparent caching on the router too, but the hit rate dropped to pretty much nothing after I installed cvmfs locally on each host. So I've removed it long time ago and pretty sure there is no squid anywhere in my network. ID: 49689 · Reply Quote

Ken_g6 Send message Joined: 4 Jul 06 Posts: 7 Credit: 353,959 RAC: 3	Message 49690 - Posted: 2 Mar 2024, 22:30:27 UTC I did a little research. Seems like it should be possible to set Apache and PHP to accept up to 2GB files. Beyond that it starts hitting integer overflows. ID: 49690 · Reply Quote

wujj123456 Send message Joined: 14 Sep 08 Posts: 52 Credit: 74,685,292 RAC: 67,692	Message 49698 - Posted: 5 Mar 2024, 4:52:04 UTC - in response to Message 49692. Last modified: 5 Mar 2024, 4:52:14 UTC It seems that there was a preliminary misconfiguration of the BOINC jobs, and this should be fixed now. I suppose this means server won't be configured to accept the big uploads. If so, will the bad WUs already sent out be aborted from the server side? Or should we just abort the upload after computation finishes? ID: 49698 · Reply Quote

Nils Volunteer moderator Project administrator Project developer Project tester Send message Joined: 15 Jul 05 Posts: 251 Credit: 6,001,083 RAC: 0	Message 49699 - Posted: 5 Mar 2024, 7:10:35 UTC - in response to Message 49698. Last modified: 5 Mar 2024, 7:14:02 UTC We are still trying to debug the file upload blocker on the servers, as we have not set a limit, but the httpd process crashes during these uploads. What is strange is that we have not seen these mpm prefork crashes before, probably because the files were smaller. It looks like the servers run out of memory during the upload. ID: 49699 · Reply Quote

LHC@home