Message boards :
ATLAS application :
hits file upload fails immediately
Message board moderation
Author | Message |
---|---|
Send message Joined: 4 Jul 06 Posts: 7 Credit: 339,475 RAC: 0 |
Yes, it's the big one. No, I'm not using any proxy. Ubuntu 20.04 with BOINC 7.16.6. I've tried lots of times but it just won't upload. I found a <file_xfer_debug> flag in cc_config.xml to show more: Fri 01 Mar 2024 01:34:06 AM MST | LHC@home | [fxd] starting upload, upload_offset -1 Fri 01 Mar 2024 01:34:06 AM MST | LHC@home | Started upload of mhzMDmrq1z4np2BDcpmwOghnABFKDmABFKDm73LSDmi7hKDm0utSzn_0_r514301471_ATLAS_hits Fri 01 Mar 2024 01:34:06 AM MST | LHC@home | [file_xfer] URL: http://lhcathome-upload.cern.ch/lhcathome_cgi/file_upload_handler Fri 01 Mar 2024 01:34:08 AM MST | LHC@home | [file_xfer] http op done; retval 0 (Success) Fri 01 Mar 2024 01:34:08 AM MST | LHC@home | [file_xfer] parsing upload response: <data_server_reply> <status>0</status> <file_size>0</file_size></data_server_reply> Fri 01 Mar 2024 01:34:08 AM MST | LHC@home | [file_xfer] parsing status: 0 Fri 01 Mar 2024 01:34:08 AM MST | LHC@home | [fxd] starting upload, upload_offset 0 Fri 01 Mar 2024 01:34:09 AM MST | LHC@home | [file_xfer] http op done; retval -224 (permanent HTTP error) Fri 01 Mar 2024 01:34:09 AM MST | LHC@home | [file_xfer] file transfer status -224 (permanent HTTP error) Fri 01 Mar 2024 01:34:09 AM MST | LHC@home | Backing off 04:15:39 on upload of mhzMDmrq1z4np2BDcpmwOghnABFKDmABFKDm73LSDmi7hKDm0utSzn_0_r514301471_ATLAS_hits |
Send message Joined: 16 May 16 Posts: 3 Credit: 14,673,082 RAC: 7,026 |
Got the same problem myself with 2 computers. Interestingly the hits files failing to upload are much larger than the others. Typical results files seem to be in the 870-890mb range while these are over 1.3gb Had more than a dozen smaller results files upload successfully while these have been stuck so I presume it's something on the server side blocking the larger than usual files |
Send message Joined: 15 Jun 08 Posts: 2519 Credit: 251,152,564 RAC: 118,696 |
Looks like the upload server is OK. Just uploaded 2 HITS files without issues, size > 900 MB each. ATLAS downloads go via the same server. Got a fresh EVNT file also without issues a few minutes ago. The log snippets just tell you what you already know: HTTP connections fail. Eventually not directly related, but faking 100 cores for an i7-6700 CPU might cause unwanted side effects: https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10504663 Owner Ken_g6 Created 31 Oct 2017, 17:33:02 UTC CPU type GenuineIntel Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz [Family 6 Model 94 Stepping 3] Number of processors 100 |
Send message Joined: 15 Jun 08 Posts: 2519 Credit: 251,152,564 RAC: 118,696 |
Those issues are typical for ATLAS tasks configured to run too many events (1000 or even 2000). Looks like there is a batch of them in the queue (got a 2000 eventer myself today). There's not really a solution except that huge #events should not be configured by the submitter. As you can read in another thread there are a couple of volunteers aggressively requesting those tasks contrary to all experience and agreements made in the past. |
Send message Joined: 14 Sep 08 Posts: 51 Credit: 62,963,540 RAC: 86,039 |
Well, we want bigger jobs assuming the project server can handle it. If LHC servers can't handle the upload, then of course it shouldn't issue such broken WUs. That's no different than any projects shouldn't release WUs that wouldn't work, big or small. It's just waste of everyone's time and resources. If there were past agreements, does that mean whoever is submitting the batches are either new or not aware of the issues? In the meantime, is the only solution to abort the upload and thus fail these finished tasks? |
Send message Joined: 14 Sep 08 Posts: 51 Credit: 62,963,540 RAC: 86,039 |
All my stuck uploads are around 1.4G and the 800-900MB ones upload without a problem at the same time. I suppose those are 2K event WUs and server is not configured to accept files over some threshold? |
Send message Joined: 3 Nov 12 Posts: 55 Credit: 138,787,820 RAC: 105,444 |
Sounds for me like the old upload malfunction of the local squid. Have you set the upload-cache large enough for this gigabyte files? For a temporary workaround try this: Set for "lhcathome-upload.cern.ch" the no proxy option in Boinc-Manager Works here like a charm. See here: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5474&postid=47465 |
Send message Joined: 16 May 16 Posts: 3 Credit: 14,673,082 RAC: 7,026 |
How would one go about setting this if not using a squid cache? |
Send message Joined: 3 Nov 12 Posts: 55 Credit: 138,787,820 RAC: 105,444 |
How would one go about setting this if not using a squid cache? Hit and sunk |
Send message Joined: 16 May 16 Posts: 3 Credit: 14,673,082 RAC: 7,026 |
Hit and sunk That supposed to mean there's no way of doing it? |
Send message Joined: 25 Sep 17 Posts: 99 Credit: 3,425,566 RAC: 1 |
If you are not using Squid, there is nothing to set. Are you having problems with uploads? |
Send message Joined: 4 Jul 06 Posts: 7 Credit: 339,475 RAC: 0 |
Eventually not directly related, but faking 100 cores for an i7-6700 CPU might cause unwanted side effects I like to set number of cores to use, instead of calculating percent of the CPU in my head. I almost never set it to more than 8. Anyway, I got one task in; hopefully this other one will go eventually. |
Send message Joined: 3 Nov 12 Posts: 55 Credit: 138,787,820 RAC: 105,444 |
That supposed to mean there's no way of doing it? That's right. You shoot and hit, so my boat sinks and you win. |
Send message Joined: 14 Sep 08 Posts: 51 Credit: 62,963,540 RAC: 86,039 |
I'm not using squid though. I did use a sock5 proxy but bypassing that didn't help either. |
Send message Joined: 15 Jun 08 Posts: 2519 Credit: 251,152,564 RAC: 118,696 |
What makes the project's downloads faster is the local cache Squid provides for HTTP objects. It's proxy comes on top automatically. SOCKS proxies usually don't cache anything, does yours? If not you may try a Squid between your clients and the SOCKS proxy. Either this: Internet <-> local Router <-> SOCKS <-> Squid <-> local clients Or this: Internet <-> local Router <-> Squid <-> local clients |
Send message Joined: 15 Jun 08 Posts: 2519 Credit: 251,152,564 RAC: 118,696 |
Did some tests to get out if an upload size limit exists. It does. :-( Looks like files > 1024 MB do not upload to lhcathome-upload.cern.ch. Still unclear whether the limit is set - at the project server - at the client side, e.g. hardwired or implicitly a libcurl limit Since in any case there will not be a quick solution tasks producing an upload file > 1024 MB are lost and should be cancelled. As for the Squid workaround mentioned in other posts: client_request_buffer_max_size xyz MB During the tests the value xyz was set to 100. Nonetheless, files larger than that but < 1024 MB uploaded fine. Only if the option is not set in squid.conf uploads via Squid stuck. Looks like the option just needs to be there. Squid version used: v6.6 on Linux Other Squid versions (especially 5.x) may behave differently. |
Send message Joined: 14 Sep 08 Posts: 51 Credit: 62,963,540 RAC: 86,039 |
Just to elaborate on my socks proxy, though I believe we've ruled it out already. You are correct that socks doesn't cache. I use socks not for LHC, but to route all BOINC traffic through it. I need to do traffic shaping for upload because my stupid asymmetric cable broadband has abysmal upload speed. :-( As for squid, I run native and have cvmfs installed on each host with cloudflare CDN config. I used to have squid doing transparent caching on the router too, but the hit rate dropped to pretty much nothing after I installed cvmfs locally on each host. So I've removed it long time ago and pretty sure there is no squid anywhere in my network. |
Send message Joined: 4 Jul 06 Posts: 7 Credit: 339,475 RAC: 0 |
|
Send message Joined: 14 Sep 08 Posts: 51 Credit: 62,963,540 RAC: 86,039 |
It seems that there was a preliminary misconfiguration of the BOINC jobs, and this should be fixed now. I suppose this means server won't be configured to accept the big uploads. If so, will the bad WUs already sent out be aborted from the server side? Or should we just abort the upload after computation finishes? |
Send message Joined: 15 Jul 05 Posts: 247 Credit: 5,974,599 RAC: 0 |
We are still trying to debug the file upload blocker on the servers, as we have not set a limit, but the httpd process crashes during these uploads. What is strange is that we have not seen these mpm prefork crashes before, probably because the files were smaller. It looks like the servers run out of memory during the upload. |
©2024 CERN