Message boards : ATLAS application : hits file upload fails immediately
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Ken_g6

Send message
Joined: 4 Jul 06
Posts: 7
Credit: 339,011
RAC: 36
Message 49671 - Posted: 1 Mar 2024, 8:38:37 UTC

Yes, it's the big one. No, I'm not using any proxy. Ubuntu 20.04 with BOINC 7.16.6. I've tried lots of times but it just won't upload.

I found a <file_xfer_debug> flag in cc_config.xml to show more:

Fri 01 Mar 2024 01:34:06 AM MST | LHC@home | [fxd] starting upload, upload_offset -1
Fri 01 Mar 2024 01:34:06 AM MST | LHC@home | Started upload of mhzMDmrq1z4np2BDcpmwOghnABFKDmABFKDm73LSDmi7hKDm0utSzn_0_r514301471_ATLAS_hits
Fri 01 Mar 2024 01:34:06 AM MST | LHC@home | [file_xfer] URL: http://lhcathome-upload.cern.ch/lhcathome_cgi/file_upload_handler
Fri 01 Mar 2024 01:34:08 AM MST | LHC@home | [file_xfer] http op done; retval 0 (Success)
Fri 01 Mar 2024 01:34:08 AM MST | LHC@home | [file_xfer] parsing upload response: <data_server_reply>    <status>0</status>    <file_size>0</file_size></data_server_reply>
Fri 01 Mar 2024 01:34:08 AM MST | LHC@home | [file_xfer] parsing status: 0
Fri 01 Mar 2024 01:34:08 AM MST | LHC@home | [fxd] starting upload, upload_offset 0
Fri 01 Mar 2024 01:34:09 AM MST | LHC@home | [file_xfer] http op done; retval -224 (permanent HTTP error)
Fri 01 Mar 2024 01:34:09 AM MST | LHC@home | [file_xfer] file transfer status -224 (permanent HTTP error)
Fri 01 Mar 2024 01:34:09 AM MST | LHC@home | Backing off 04:15:39 on upload of mhzMDmrq1z4np2BDcpmwOghnABFKDmABFKDm73LSDmi7hKDm0utSzn_0_r514301471_ATLAS_hits

ID: 49671 · Report as offensive     Reply Quote
Senture

Send message
Joined: 16 May 16
Posts: 3
Credit: 13,107,359
RAC: 256
Message 49672 - Posted: 1 Mar 2024, 9:24:24 UTC

Got the same problem myself with 2 computers. Interestingly the hits files failing to upload are much larger than the others. Typical results files seem to be in the 870-890mb range while these are over 1.3gb

Had more than a dozen smaller results files upload successfully while these have been stuck so I presume it's something on the server side blocking the larger than usual files
ID: 49672 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2418
Credit: 226,729,019
RAC: 130,317
Message 49673 - Posted: 1 Mar 2024, 9:34:03 UTC - in response to Message 49671.  
Last modified: 1 Mar 2024, 9:35:32 UTC

Looks like the upload server is OK.
Just uploaded 2 HITS files without issues, size > 900 MB each.

ATLAS downloads go via the same server.
Got a fresh EVNT file also without issues a few minutes ago.

The log snippets just tell you what you already know:
HTTP connections fail.


Eventually not directly related, but faking 100 cores for an i7-6700 CPU might cause unwanted side effects:
https://lhcathome.cern.ch/lhcathome/show_host_detail.php?hostid=10504663
Owner 	Ken_g6
Created 	31 Oct 2017, 17:33:02 UTC
CPU type 	GenuineIntel
Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz [Family 6 Model 94 Stepping 3]
Number of processors 	100
ID: 49673 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2418
Credit: 226,729,019
RAC: 130,317
Message 49675 - Posted: 1 Mar 2024, 13:05:05 UTC - in response to Message 49672.  

Those issues are typical for ATLAS tasks configured to run too many events (1000 or even 2000).
Looks like there is a batch of them in the queue (got a 2000 eventer myself today).

There's not really a solution except that huge #events should not be configured by the submitter.
As you can read in another thread there are a couple of volunteers aggressively requesting those tasks contrary to all experience and agreements made in the past.
ID: 49675 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 43
Credit: 52,183,546
RAC: 142,879
Message 49676 - Posted: 1 Mar 2024, 17:19:29 UTC - in response to Message 49675.  
Last modified: 1 Mar 2024, 17:27:11 UTC


There's not really a solution except that huge #events should not be configured by the submitter.
As you can read in another thread there are a couple of volunteers aggressively requesting those tasks contrary to all experience and agreements made in the past.

Well, we want bigger jobs assuming the project server can handle it. If LHC servers can't handle the upload, then of course it shouldn't issue such broken WUs. That's no different than any projects shouldn't release WUs that wouldn't work, big or small. It's just waste of everyone's time and resources.

If there were past agreements, does that mean whoever is submitting the batches are either new or not aware of the issues? In the meantime, is the only solution to abort the upload and thus fail these finished tasks?
ID: 49676 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 43
Credit: 52,183,546
RAC: 142,879
Message 49677 - Posted: 1 Mar 2024, 17:21:53 UTC

All my stuck uploads are around 1.4G and the 800-900MB ones upload without a problem at the same time. I suppose those are 2K event WUs and server is not configured to accept files over some threshold?
ID: 49677 · Report as offensive     Reply Quote
Saturn911

Send message
Joined: 3 Nov 12
Posts: 36
Credit: 118,028,939
RAC: 128,729
Message 49678 - Posted: 1 Mar 2024, 18:03:11 UTC - in response to Message 49677.  
Last modified: 1 Mar 2024, 18:16:36 UTC

Sounds for me like the old upload malfunction of the local squid.
Have you set the upload-cache large enough for this gigabyte files?

For a temporary workaround try this:
Set for "lhcathome-upload.cern.ch" the no proxy option in Boinc-Manager
Works here like a charm.

See here:
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5474&postid=47465
ID: 49678 · Report as offensive     Reply Quote
Senture

Send message
Joined: 16 May 16
Posts: 3
Credit: 13,107,359
RAC: 256
Message 49679 - Posted: 1 Mar 2024, 18:42:22 UTC - in response to Message 49678.  

How would one go about setting this if not using a squid cache?
ID: 49679 · Report as offensive     Reply Quote
Saturn911

Send message
Joined: 3 Nov 12
Posts: 36
Credit: 118,028,939
RAC: 128,729
Message 49680 - Posted: 1 Mar 2024, 18:55:19 UTC - in response to Message 49679.  

How would one go about setting this if not using a squid cache?


Hit and sunk
ID: 49680 · Report as offensive     Reply Quote
Senture

Send message
Joined: 16 May 16
Posts: 3
Credit: 13,107,359
RAC: 256
Message 49681 - Posted: 1 Mar 2024, 19:04:02 UTC - in response to Message 49680.  

Hit and sunk

That supposed to mean there's no way of doing it?
ID: 49681 · Report as offensive     Reply Quote
Jonathan

Send message
Joined: 25 Sep 17
Posts: 99
Credit: 3,261,384
RAC: 3,595
Message 49682 - Posted: 1 Mar 2024, 19:43:32 UTC - in response to Message 49681.  
Last modified: 1 Mar 2024, 19:44:28 UTC

If you are not using Squid, there is nothing to set.
Are you having problems with uploads?
ID: 49682 · Report as offensive     Reply Quote
Ken_g6

Send message
Joined: 4 Jul 06
Posts: 7
Credit: 339,011
RAC: 36
Message 49683 - Posted: 1 Mar 2024, 20:03:10 UTC - in response to Message 49673.  

Eventually not directly related, but faking 100 cores for an i7-6700 CPU might cause unwanted side effects

I like to set number of cores to use, instead of calculating percent of the CPU in my head. I almost never set it to more than 8.

Anyway, I got one task in; hopefully this other one will go eventually.
ID: 49683 · Report as offensive     Reply Quote
Saturn911

Send message
Joined: 3 Nov 12
Posts: 36
Credit: 118,028,939
RAC: 128,729
Message 49684 - Posted: 1 Mar 2024, 20:29:31 UTC - in response to Message 49681.  

That supposed to mean there's no way of doing it?


That's right.
You shoot and hit, so my boat sinks and you win.
ID: 49684 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 43
Credit: 52,183,546
RAC: 142,879
Message 49685 - Posted: 2 Mar 2024, 1:56:46 UTC - in response to Message 49678.  

I'm not using squid though. I did use a sock5 proxy but bypassing that didn't help either.
ID: 49685 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2418
Credit: 226,729,019
RAC: 130,317
Message 49686 - Posted: 2 Mar 2024, 9:11:30 UTC - in response to Message 49685.  

What makes the project's downloads faster is the local cache Squid provides for HTTP objects.
It's proxy comes on top automatically.

SOCKS proxies usually don't cache anything, does yours?
If not you may try a Squid between your clients and the SOCKS proxy.

Either this:
Internet <-> local Router <-> SOCKS <-> Squid <-> local clients

Or this:
Internet <-> local Router <-> Squid <-> local clients
ID: 49686 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2418
Credit: 226,729,019
RAC: 130,317
Message 49688 - Posted: 2 Mar 2024, 14:41:54 UTC

Did some tests to get out if an upload size limit exists.
It does.
:-(


Looks like files > 1024 MB do not upload to lhcathome-upload.cern.ch.

Still unclear whether the limit is set
- at the project server
- at the client side, e.g. hardwired or implicitly a libcurl limit

Since in any case there will not be a quick solution tasks producing an upload file > 1024 MB are lost and should be cancelled.



As for the Squid workaround mentioned in other posts:
client_request_buffer_max_size xyz MB

During the tests the value xyz was set to 100.
Nonetheless, files larger than that but < 1024 MB uploaded fine.

Only if the option is not set in squid.conf uploads via Squid stuck.
Looks like the option just needs to be there.

Squid version used: v6.6 on Linux
Other Squid versions (especially 5.x) may behave differently.
ID: 49688 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 43
Credit: 52,183,546
RAC: 142,879
Message 49689 - Posted: 2 Mar 2024, 17:29:36 UTC - in response to Message 49686.  

Just to elaborate on my socks proxy, though I believe we've ruled it out already. You are correct that socks doesn't cache. I use socks not for LHC, but to route all BOINC traffic through it. I need to do traffic shaping for upload because my stupid asymmetric cable broadband has abysmal upload speed. :-(

As for squid, I run native and have cvmfs installed on each host with cloudflare CDN config. I used to have squid doing transparent caching on the router too, but the hit rate dropped to pretty much nothing after I installed cvmfs locally on each host. So I've removed it long time ago and pretty sure there is no squid anywhere in my network.
ID: 49689 · Report as offensive     Reply Quote
Ken_g6

Send message
Joined: 4 Jul 06
Posts: 7
Credit: 339,011
RAC: 36
Message 49690 - Posted: 2 Mar 2024, 22:30:27 UTC

I did a little research. Seems like it should be possible to set Apache and PHP to accept up to 2GB files. Beyond that it starts hitting integer overflows.
ID: 49690 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 43
Credit: 52,183,546
RAC: 142,879
Message 49698 - Posted: 5 Mar 2024, 4:52:04 UTC - in response to Message 49692.  
Last modified: 5 Mar 2024, 4:52:14 UTC

It seems that there was a preliminary misconfiguration of the BOINC jobs, and this should be fixed now.

I suppose this means server won't be configured to accept the big uploads. If so, will the bad WUs already sent out be aborted from the server side? Or should we just abort the upload after computation finishes?
ID: 49698 · Report as offensive     Reply Quote
Profile Nils Høimyr
Volunteer moderator
Project administrator
Project developer
Project tester

Send message
Joined: 15 Jul 05
Posts: 242
Credit: 5,800,306
RAC: 0
Message 49699 - Posted: 5 Mar 2024, 7:10:35 UTC - in response to Message 49698.  
Last modified: 5 Mar 2024, 7:14:02 UTC

We are still trying to debug the file upload blocker on the servers, as we have not set a limit, but the httpd process crashes during these uploads. What is strange is that we have not seen these mpm prefork crashes before, probably because the files were smaller. It looks like the servers run out of memory during the upload.
ID: 49699 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : ATLAS application : hits file upload fails immediately


©2024 CERN