Hi, A ceph user hit a problem with the 3.5 precise kernel with symptoms exactly like an old poll(2) bug[1]. Basically, one end of a socket is blocked on sendmsg(2), and the other end is blocked on poll(2) waiting for data. 15 minutes later the poll(2) timeout triggers, we reset the connection, and ceph recovers and continues. (For this user, the visible ceph symptoms were stuck peering, stuck recovery, or hung requests that *eventually* cleared themselves up.) In this case, it doesn't look like the 3.5.0-37 kernel has the old problematic patch (which first appeared in 3.6-rc1 and was fixed before 3.6 was released), but we see the exact same behavior (blocked writer, blocked reader/poller, but netstat showing bytes available on the socket), and upgrading the kernel to the current 3.8 precise package resolved the problem. The 3.5 ubuntu kernel does have a few sendmsg patches[2] that (under the circumstances) appear suspicious. The one other detail in this case is that it seemed to only crop up connections involving one node in the system. I'm not sure where to go from here, since the user is happy to now have a working system, and I'm not sure if it is worth spending the time to reproduce the issue. It might be simpler to just recommend users move off the 3.5 kernel. In the meantime, though, I wanted to at least make everyone aware of the (potential) problem. sage [1] http://marc.info/?l=ceph-devel&m=134540224811321&w=2 [2] https://launchpad.net/ubuntu/+source/linux-lts-quantal/3.5.0-37.58~precise1 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html