On Wed, 4 May 2011, Jim Schutt wrote: > Hi, > > I'm seeing clients having trouble reconnecting after timed-out > requests. When they get in this state, sometimes they manage > to reconnect after several attempts; sometimes they never seem > to be able to reconnect. Hmm, the interesting line is > 2011-05-04 16:00:59.710971 7f15d6948940 -- 172.17.40.30:6806/12583 >> > 172.17.40.49:0/302440129 pipe(0x213fa000 sd=91 pgs=430 cs=1 l=1).reader bad > tag 0 That _should_ mean the server side (osd) closes out the connection immediately, which should generate a disconnect error on the client and an immediate reconnect. So it's strange that you're also seeing timeouts. Of course, we should be getting bad tags anyway, so something else is clearly wrong and may be contributing to both problems. How easy is this to reproduce? It's right after a fresh connection, so the number of possibly offending code paths is pretty small, at least! There is client side debugging to turn on, but it's very chatty. Maybe you can just enable a few key lines, like the connect handshake ones, and any point where we queue/send a tag. It's a bit tedious to enable the individual dout lines in messenger.c, sadly, but unless you have a very fast netconsole or something that's probably the only way to go... sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html