Message boards : ATLAS application : Uploading stuck
Message board moderation

To post messages, you must log in.

AuthorMessage
jelle

Send message
Joined: 26 Sep 11
Posts: 37
Credit: 7,704,455
RAC: 259
Message 40994 - Posted: 18 Dec 2019, 2:01:14 UTC

For the last few days I have upload transfers stuck. Two ATLAS tasks have been trying to upload for at least two days now. The upload file size is only 221 bytes. I have the exact same problem on another computer, so it's probably not just a solution of turning it off and on again. Not having the problem uploading and reporting to other Boinc projects, so it's unique to LHC@home. Any suggestions?
ID: 40994 · Report as offensive     Reply Quote
lazlo_vii
Avatar

Send message
Joined: 20 Nov 19
Posts: 21
Credit: 1,074,330
RAC: 0
Message 40995 - Posted: 18 Dec 2019, 4:01:41 UTC - in response to Message 40994.  
Last modified: 18 Dec 2019, 4:07:29 UTC

First, I would look at your /etc/boinc-client/cc_config.xml and double check the network settings. I am not saying "It isn't plugged in!" but that should be the first question you answer for yourself. If all is good in your config file I would try to manually update the project from the command line in one X terminal while watching the boinc-client messages in another. First open a terminal on (or to) the host and issue:

watch -n1 boinccmd --get_messages


That will terminal will refresh messages from the boinc-client service until you hit ctrl+c to kill watch.

Open a second terminal on (or to) the same host and issue which ever one these two commands matches your configuration:

boinccmd --project https://lhcathome.cern.ch/lhcathome/ update

or
boinccmd --host localhost --passwd <your_password_for_remote_access> --project https://lhcathome.cern.ch/lhcathome/ update


Switch back to the first terminal and read what boinc-client says about updating.

If that doesn't give you useful information you can try looking at /var/log/syslog and reading man nc, man netstat, and man boinccmd for more clues. Router logs might be useful to you as well.

EDIT: The updating of the project and reading boinc-client's messages can be done easily from the GUI, but what fun is that?
ID: 40995 · Report as offensive     Reply Quote
jelle

Send message
Joined: 26 Sep 11
Posts: 37
Credit: 7,704,455
RAC: 259
Message 40998 - Posted: 18 Dec 2019, 8:43:25 UTC - in response to Message 40995.  
Last modified: 18 Dec 2019, 8:46:04 UTC

Thank you for those suggestions. I have done another Update request. In the Boinc manager GUI, because using terminals seemed overkill. This is recent output from the event logl

Wed 18 Dec 2019 20:44:57 NZDT | LHC@home | Started upload of gCxMDmN9izvn9Rq4apoT9bVoABFKDmABFKDmt4SaDmABFKDmY4ACQo_0_r369045851_ATLAS_hits
Wed 18 Dec 2019 20:44:57 NZDT | LHC@home | Started upload of kspKDmb7wzvnsSi4apGgGQJmABFKDmABFKDmvDwVDmABFKDmzd6Ztn_0_r115146122_ATLAS_hits
Wed 18 Dec 2019 20:46:01 NZDT | LHC@home | Backing off 03:14:20 on upload of gCxMDmN9izvn9Rq4apoT9bVoABFKDmABFKDmt4SaDmABFKDmY4ACQo_0_r369045851_ATLAS_hits
Wed 18 Dec 2019 20:46:01 NZDT | LHC@home | Backing off 04:27:48 on upload of kspKDmb7wzvnsSi4apGgGQJmABFKDmABFKDmvDwVDmABFKDmzd6Ztn_0_r115146122_ATLAS_hits
Wed 18 Dec 2019 21:20:59 NZDT | Universe@Home | Sending scheduler request: Requested by project.
Wed 18 Dec 2019 21:20:59 NZDT | Universe@Home | Requesting new tasks for CPU
Wed 18 Dec 2019 21:21:02 NZDT | Universe@Home | Scheduler request completed: got 1 new tasks
Wed 18 Dec 2019 21:21:04 NZDT | Universe@Home | Started download of universe_bh2_190723_292_448744552_20000_1-999999_745100
Wed 18 Dec 2019 21:21:08 NZDT | Universe@Home | Finished download of universe_bh2_190723_292_448744552_20000_1-999999_745100
Wed 18 Dec 2019 21:24:43 NZDT |  | Suspending GPU computation - computer is in use
Wed 18 Dec 2019 21:35:26 NZDT | LHC@home | project resumed by user
Wed 18 Dec 2019 21:35:29 NZDT | LHC@home | Sending scheduler request: Requested by project.
Wed 18 Dec 2019 21:35:29 NZDT | LHC@home | Requesting new tasks for CPU
Wed 18 Dec 2019 21:35:33 NZDT | LHC@home | update requested by user
Wed 18 Dec 2019 21:35:33 NZDT | LHC@home | Scheduler request completed: got 1 new tasks
Wed 18 Dec 2019 21:35:35 NZDT | LHC@home | Started download of workspace1_hl14_OnErrors_OnOct_NoBB_col_B1_radial_dp_0.00003__1__s__62.31_60.32__13_13.1__6__84_1_sixvf_boinc11654.zip
Wed 18 Dec 2019 21:35:38 NZDT | LHC@home | Finished download of workspace1_hl14_OnErrors_OnOct_NoBB_col_B1_radial_dp_0.00003__1__s__62.31_60.32__13_13.1__6__84_1_sixvf_boinc11654.zip


The top lines represent the failing upload attempt. As you can see, I have no trouble connecting. I even downloaded a new SixTrack task for my effort. However, those two completed ATLAS tasks are just endlessly retrying their upload. Identical problem on another laptop, so it is not machine dependent.

I will let it run that SixTrack task and see if that can upload.
ID: 40998 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,939,826
RAC: 137,467
Message 40999 - Posted: 18 Dec 2019, 8:57:14 UTC - in response to Message 40998.  

Just to be sure your basic network connection works.

You may try:
nc -z -v -w 5 lhcathome-upload.cern.ch 80
Run it from both of your hosts using your BOINC client's account.

Output should look like this:
Connection to lhcathome-upload.cern.ch 80 port [tcp/http] succeeded!


Post any other output here for analysis.
If the command succeeds you may try a reboot.
ID: 40999 · Report as offensive     Reply Quote
jelle

Send message
Joined: 26 Sep 11
Posts: 37
Credit: 7,704,455
RAC: 259
Message 41000 - Posted: 18 Dec 2019, 9:58:13 UTC - in response to Message 40999.  

Just to be sure your basic network connection works.


Thank you for suggestion. I tried that command. Got your expected output, so no problem there.
I can also note that the SixTrack I got on my last update successfully completed, uploaded and reported, so it's only the ATLAS tasks that are stuck.
ID: 41000 · Report as offensive     Reply Quote
jelle

Send message
Joined: 26 Sep 11
Posts: 37
Credit: 7,704,455
RAC: 259
Message 41001 - Posted: 18 Dec 2019, 10:11:17 UTC - in response to Message 40999.  

For good measure I just did a reboot as well. Event log from restarting BOINC afterwards (with some irrelevant lines removed) is as follows.

Wed 18 Dec 2019 23:02:53 NZDT |  | Starting BOINC client version 7.9.3 for x86_64-pc-linux-gnu
Wed 18 Dec 2019 23:02:53 NZDT |  | log flags: file_xfer, sched_ops, task
Wed 18 Dec 2019 23:02:53 NZDT |  | Libraries: libcurl/7.58.0 OpenSSL/1.1.1 zlib/1.2.11 libidn2/2.0.4 libpsl/0.19.1 (+libidn2/2.0.4) nghttp2/1.30.0 librtmp/2.3
Wed 18 Dec 2019 23:02:53 NZDT |  | Data directory: /var/lib/boinc-client
Wed 18 Dec 2019 23:02:53 NZDT |  | CUDA: NVIDIA GPU 0: GeForce GTX 1050 (driver version 390.11, CUDA version 9.1, compute capability 6.1, 1999MB, 1744MB available, 1960 GFLOPS peak)
Wed 18 Dec 2019 23:02:53 NZDT |  | OpenCL: NVIDIA GPU 0: GeForce GTX 1050 (driver version 390.116, device version OpenCL 1.2 CUDA, 1999MB, 1744MB available, 1960 GFLOPS peak)
Wed 18 Dec 2019 23:02:53 NZDT |  | [libc detection] gathered: 2.27, Ubuntu GLIBC 2.27-3ubuntu1
Wed 18 Dec 2019 23:02:53 NZDT |  | Host name: ZARX1804
Wed 18 Dec 2019 23:02:53 NZDT |  | Processor: 4 GenuineIntel Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz [Family 6 Model 58 Stepping 9]
Wed 18 Dec 2019 23:02:53 NZDT |  | OS: Linux Ubuntu: Ubuntu 18.04.3 LTS [5.0.0-37-generic|libc 2.27 (Ubuntu GLIBC 2.27-3ubuntu1)]
Wed 18 Dec 2019 23:02:53 NZDT |  | Memory: 15.55 GB physical, 0 bytes virtual
Wed 18 Dec 2019 23:02:53 NZDT |  | Disk: 38.20 GB total, 9.35 GB free
Wed 18 Dec 2019 23:02:53 NZDT |  | Local time is UTC +13 hours
Wed 18 Dec 2019 23:02:53 NZDT |  | VirtualBox version: 6.0.8r130520
Wed 18 Dec 2019 23:02:53 NZDT |  | Config: GUI RPCs allowed from:
Wed 18 Dec 2019 23:02:53 NZDT |  | Last benchmark was 34 days 03:00:57 ago
Wed 18 Dec 2019 23:02:53 NZDT | Asteroids@home | URL http://asteroidsathome.net/boinc/; Computer ID 532806; resource share 25
Wed 18 Dec 2019 23:02:53 NZDT | Einstein@Home | URL http://einstein.phys.uwm.edu/; Computer ID 12751963; resource share 100
Wed 18 Dec 2019 23:02:53 NZDT | LHC@home | URL https://lhcathome.cern.ch/lhcathome/; Computer ID 10543635; resource share 300
Wed 18 Dec 2019 23:02:53 NZDT | Rosetta@home | URL http://boinc.bakerlab.org/rosetta/; Computer ID 3394589; resource share 100
Wed 18 Dec 2019 23:02:53 NZDT | Universe@Home | URL https://universeathome.pl/universe/; Computer ID 489260; resource share 100
Wed 18 Dec 2019 23:02:53 NZDT |  | Running CPU benchmarks
Wed 18 Dec 2019 23:02:53 NZDT |  | Suspending computation - CPU benchmarks in progress
Wed 18 Dec 2019 23:03:24 NZDT |  | Benchmark results:
Wed 18 Dec 2019 23:03:24 NZDT |  | Number of CPUs: 2
Wed 18 Dec 2019 23:03:24 NZDT |  | 4343 floating point MIPS (Whetstone) per CPU
Wed 18 Dec 2019 23:03:24 NZDT |  | 126625 integer MIPS (Dhrystone) per CPU
Wed 18 Dec 2019 23:03:25 NZDT |  | Suspending GPU computation - computer is in use
Wed 18 Dec 2019 23:03:40 NZDT | LHC@home | Started upload of gCxMDmN9izvn9Rq4apoT9bVoABFKDmABFKDmt4SaDmABFKDmY4ACQo_0_r369045851_ATLAS_hits
Wed 18 Dec 2019 23:03:59 NZDT | Rosetta@home | project resumed by user
Wed 18 Dec 2019 23:04:12 NZDT | Universe@Home | project resumed by user
Wed 18 Dec 2019 23:04:17 NZDT | Universe@Home | work fetch resumed by user
Wed 18 Dec 2019 23:04:17 NZDT | Rosetta@home | work fetch resumed by user
Wed 18 Dec 2019 23:04:44 NZDT | LHC@home | Backing off 04:49:32 on upload of gCxMDmN9izvn9Rq4apoT9bVoABFKDmABFKDmt4SaDmABFKDmY4ACQo_0_r369045851_ATLAS_hits
Wed 18 Dec 2019 23:05:24 NZDT | LHC@home | Started upload of kspKDmb7wzvnsSi4apGgGQJmABFKDmABFKDmvDwVDmABFKDmzd6Ztn_0_r115146122_ATLAS_hits
Wed 18 Dec 2019 23:06:26 NZDT | LHC@home | Backing off 03:30:16 on upload of kspKDmb7wzvnsSi4apGgGQJmABFKDmABFKDmvDwVDmABFKDmzd6Ztn_0_r115146122_ATLAS_hits


So the ATLAS files are still receiving a project backoff when they try to upload, while everything else uploads fine.
ID: 41001 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,128,280
RAC: 105,358
Message 41003 - Posted: 18 Dec 2019, 12:27:04 UTC

What can be is, that the Atlas-Server lost your task from the Server-Side.
You can see on your hosts, that some Atlas-tasks you want to upload, are finished from a other Computer.
It's a small window from three or four days, than other User get the same Atlas-tasks also.
Than it is hard, but the best is to abort your tasks.
Would control it, for the next time, because you have Atlas-Tasks which are finished correct.
ID: 41003 · Report as offensive     Reply Quote
PekkaH

Send message
Joined: 23 Dec 19
Posts: 15
Credit: 29,916,843
RAC: 39,679
Message 48173 - Posted: 2 Jun 2023, 7:37:04 UTC

Hi,
I've number of hosts which experience stuck atlas uploads (~20 of them).
I have checked my proxy but other workloads like sixtrack & theory loads correctly so I suspect issues on atlas itself.
Does anyone else experience similar situation?

/Pekka
ID: 48173 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,939,826
RAC: 137,467
Message 48174 - Posted: 2 Jun 2023, 7:54:28 UTC - in response to Message 48173.  

Do you use a Squid 5.x?
If so, add the following line to squid.conf:
client_request_buffer_max_size 512 MB


then run:
[sudo] squid -k reconfigure
ID: 48174 · Report as offensive     Reply Quote
PekkaH

Send message
Joined: 23 Dec 19
Posts: 15
Credit: 29,916,843
RAC: 39,679
Message 48175 - Posted: 2 Jun 2023, 9:43:16 UTC - in response to Message 48174.  

Thanx,

Yes, I've ubuntu22.04 & squid 5.2
I added the said conf option but squid -k reconfigure has no effect (at least yet).
I will restart the squid vm ....

/pekka
ID: 48175 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,128,280
RAC: 105,358
Message 48177 - Posted: 2 Jun 2023, 10:27:01 UTC - in response to Message 48175.  

Wed 18 Dec 2019 23:02:53 NZDT | | Starting BOINC client version 7.9.3 for x86_64-pc-linux-gnu
Wed 18 Dec 2019 23:02:53 NZDT | | log flags: file_xfer, sched_ops, task
do you have timestamp problems?
Battery empty?
ID: 48177 · Report as offensive     Reply Quote
PekkaH

Send message
Joined: 23 Dec 19
Posts: 15
Credit: 29,916,843
RAC: 39,679
Message 48179 - Posted: 2 Jun 2023, 11:00:40 UTC - in response to Message 48177.  

Hi,

I don't see timestamp issues on those logs which have went thru (the same host). Few atlas jobs have successfully uploaded but many are hanging. My setup has own ntp server which the servers are constantly syncing.
And btw - the setup has been running for months w.o. major issues - atlas upload stuck started to manifest itself around 30.5, 1540 eet. No problems before that for many months.

/pekka
ID: 48179 · Report as offensive     Reply Quote
PekkaH

Send message
Joined: 23 Dec 19
Posts: 15
Credit: 29,916,843
RAC: 39,679
Message 48183 - Posted: 2 Jun 2023, 15:36:36 UTC - in response to Message 48179.  

Hi,
on client side, transfer tab, I can see lots of project backoff ...
/pekka
ID: 48183 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2386
Credit: 222,939,826
RAC: 137,467
Message 48184 - Posted: 2 Jun 2023, 16:05:04 UTC - in response to Message 48183.  
Last modified: 2 Jun 2023, 16:08:38 UTC

... but squid -k reconfigure has no effect...
... atlas upload stuck started to manifest itself around 30.5, 1540 eet. No problems before that for many months. ...


Your uploads in the past succeeded via the same proxy?
Did you recently upgrade something on the proxy box, on your router or your firewall?

It may or may not be the proxy that causes the trouble.
To find out whether the proxy works as expected, check some commands/logs:
"squid -k probe" (check the output for errors)
if that command doesn't print an error, run again (as root) "squid -k reconfigure"
this produces messages (usually) in /var/log/squid/cache.log => check this for recent errors
also check /var/log/squid/access.log for corresponding lines when your clients contact the internet.
=> "TCP_MEM_HIT", "TCP_HIT", "TCP_REFRESH..." are fine, "...ABORTED...", "...DENIED..." are bad and need further investigation.


If none of the commands/log entries indicate a proxy error the problems have most likely another origin.

<edit>
Just in case your older logs include useful hints, make your computers visible on the prefs page.
</edit>
ID: 48184 · Report as offensive     Reply Quote
PekkaH

Send message
Joined: 23 Dec 19
Posts: 15
Credit: 29,916,843
RAC: 39,679
Message 48185 - Posted: 2 Jun 2023, 17:31:56 UTC - in response to Message 48184.  

Hi,

as some atlas jobs got upload thru and sixtrack, cms and theory work as expected, I don't suspect squid anymore
Instead, I think there is something causing project backoff on cern server side. One of my hosts managed to upload all jobs whereas there are still 3 more with hanging uploads ...

/pekka
ID: 48185 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2071
Credit: 156,128,280
RAC: 105,358
Message 48186 - Posted: 2 Jun 2023, 18:44:11 UTC - in response to Message 48185.  

If you have Squid in Boinc active, you can stop squid in preferences of Boinc and save it.
If upload from Atlas start not correct, are this tasks in your LHC@Home user listed?
ID: 48186 · Report as offensive     Reply Quote
PekkaH

Send message
Joined: 23 Dec 19
Posts: 15
Credit: 29,916,843
RAC: 39,679
Message 48187 - Posted: 3 Jun 2023, 8:03:43 UTC - in response to Message 48186.  

Hi,
thnx on support everyone. Seems that problem was on cern server side as now all hosts have managed to upload the results. The only change that I did this time was addition of the squid "client_req..." conf option as that was on its default previously. But that change didn't alter the behavior at my end, the queues started to clear itself yesterday afternoon and now all is fine again.
No further actions needed.
/pekka
ID: 48187 · Report as offensive     Reply Quote
kotenok2000
Avatar

Send message
Joined: 21 Feb 11
Posts: 58
Credit: 539,580
RAC: 73
Message 48202 - Posted: 7 Jun 2023, 12:26:10 UTC - in response to Message 48177.  

Nice nercropost.
ID: 48202 · Report as offensive     Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 418
Credit: 5,667,249
RAC: 48
Message 48634 - Posted: 23 Sep 2023, 3:33:57 UTC

If it's still relevant....
ID: 48634 · Report as offensive     Reply Quote

Message boards : ATLAS application : Uploading stuck


©2024 CERN