On Mon, Dec 7, 2015 at 9:56 PM, Matt Conner <matt.conner@xxxxxxxxxxxxxx> wrote: > Hi, > > We have a Ceph cluster in which we have been having issues with RBD > clients hanging when an OSD failure occurs. We are using a NAS gateway > server which maps RBD images to filesystems and serves the filesystems > out via NFS. The gateway server has close to 180 NFS clients and > almost every time even 1 OSD goes down during heavy load, the NFS > exports lock up and the clients are unable to access the NAS share via > NFS. When the OSD fails, Ceph recovers without issue, but the gateway > kernel RBD module appears to get stuck waiting on the now failed OSD. > Note that this works correctly when under lighter loads. > > From what we have been able to determine, the NFS server daemon hangs > waiting for I/O from the OSD that went out and never recovers. > Similarly, attempting to access files from the exported FS locally on > the gateway server will result in a similar hang. We also noticed that > Ceph health details will continue to report blocked I/O on the now > down OSD until either the OSD is recovered or the gateway server is > rebooted. Based on a few kernel logs from NFS and PVS, we were able > to trace the problem to the RBD kernel module. > > Unfortunately, the only way we have been able to recover our gateway > is by hard rebooting the server. > > Has anyone else encountered this issue and/or have a possible solution? > Are there suggestions for getting more detailed debugging information > from the RBD kernel module? Dumping osdc, osdmap and ceph status when it gets stuck would be a start: # cat /sys/kernel/debug/ceph/*/osdmap # cat /sys/kernel/debug/ceph/*/osdc $ ceph status Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html