Sage Weil <sage@xxxxxxxxxxx> writes: > Hi, > > A ceph user hit a problem with the 3.5 precise kernel with symptoms > exactly like an old poll(2) bug[1]. Basically, one end of a socket is > blocked on sendmsg(2), and the other end is blocked on poll(2) waiting for > data. 15 minutes later the poll(2) timeout triggers, we reset the > connection, and ceph recovers and continues. (For this user, the visible > ceph symptoms were stuck peering, stuck recovery, or hung requests that > *eventually* cleared themselves up.) > > In this case, it doesn't look like the 3.5.0-37 kernel has the old > problematic patch (which first appeared in 3.6-rc1 and was fixed before > 3.6 was released), but we see the exact same behavior (blocked writer, > blocked reader/poller, but netstat showing bytes available on the socket), > and upgrading the kernel to the current 3.8 precise package resolved the > problem. The 3.5 ubuntu kernel does have a few sendmsg patches[2] that > (under the circumstances) appear suspicious. > > The one other detail in this case is that it seemed to only crop up > connections involving one node in the system. > > I'm not sure where to go from here, since the user is happy to now have a > working system, and I'm not sure if it is worth spending the time to > reproduce the issue. It might be simpler to just recommend users move off > the 3.5 kernel. In the meantime, though, I wanted to at least make > everyone aware of the (potential) problem. > > sage > > > [1] http://marc.info/?l=ceph-devel&m=134540224811321&w=2 > [2] https://launchpad.net/ubuntu/+source/linux-lts-quantal/3.5.0-37.58~precise1 I believe the suspicious commits you're referring to in the Quantal kernel are: 1be374a net: Block MSG_CMSG_COMPAT in send(m)msg and recv(m)msg a7526eb net: Unbreak compat_sys_{send,recv}msg Both of these commits came in through upstream stable updates and are clean cherry-picks. All the upstream stable kernels seem to contain it. [ Note however that most of the stable kernels have squashed these 2 commits in a single commit. ] This means that, if you're correct, it is likely that the Raring kernel will also have this issue: 3.8.0-27.40 Raring kernel has these 2 commits as well. Could you please confirm the user that reported this issue is running this kernel (or later)? Cheers, -- Luis -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html