On Thu, 20 Dec 2018 at 15:11, Jeff King <peff@xxxxxxxx> wrote: > > OK, that's about what I expected. Here we have clone's sideband-demux > thread waiting to pull more packfile data from the remote: > > > (gdb) thread apply all bt > > > > Thread 2 (Thread 0x7faafbf1c700 (LWP 36586)): > > #0 0x00007faafc805384 in __libc_read (fd=fd@entry=5, > > buf=buf@entry=0x7faafbf0ddec, nbytes=nbytes@entry=5) > > at ../sysdeps/unix/sysv/linux/read.c:27 > > #1 0x000055c8ca2f5b23 in read (__nbytes=5, __buf=0x7faafbf0ddec, __fd=5) > > at /usr/include/x86_64-linux-gnu/bits/unistd.h:44 > > [...coming from packet_read / recv_sideband / sideband_demux...] > > I assume fd=5 there is a pipe connected to ssh. You could double check > with "lsof" or similar, but I don't think it would ever be reading from > anywhere else. I checked and in all cases git was reading from a pipe. [snip] > with each blocking on read() from its predecessor. So you need to find > out why "ssh" is blocking. Unfortunately, short of a bug in ssh, the > likely cause is either: > > 1. The git-upload-pack on the remote side stopped generating data for > some reason. You may or may not have access on the remotehost to > dig into that. > > It's certainly possible there's a deadlock bug between the server > and client side of a Git conversation. But I'd find it extremely > unlikely to find such a deadlock bug at this point in the > conversation, because at this point the client side has nothing > left to say to the server. The server should just be streaming out > the packfile bytes and then closing the descriptor. I think it's highly unlikely too given how many good runs we generally have. > You mentioned "Phabricator sshd scripts" running on the server. > I don't know what Phabricator might be sticking in the middle of > the connection, but that could be the source of the stall. I think you're right. I set up a seperate sshd on a different port on the same machine where there were no Phabricator callouts and the problem never manifested... > 2. It's possible the network connection dropped but ssh did not > notice. Maybe try turning on ssh keepalives and seeing if it > eventually times out? I had already done this but the problem still manifested. I went so far as checking if the problem would happen over the loopback device on the same machine (via localhost) and the problem continued happening so I'm fairly sure that rules out networking issues. -- Sitsofe | http://sucs.org/~sits/