On 10/22/2015 10:57 PM, John-Paul Robinson wrote: > Hi, > > Has anyone else experienced a problem with RBD-to-NFS gateways blocking > nfsd server requests when their ceph cluster has a placement group that > is not servicing I/O for some reason, eg. too few replicas or an osd > with slow request warnings? > > We have an RBD-NFS gateway that stops responding to NFS clients > (interaction with RBD-backed NFS shares hang on the NFS client), > whenever our ceph cluster has some part of it in an I/O block > condition. This issue only affects the ability of the nfsd processes > to serve requests to the client. I can look at and access underlying > mounted RBD containers without issue, although they appear hung from the > NFS client side. The gateway node load numbers spike to a number that > reflects the number of nfsd processes, but the system is otherwise > untaxed (unlike the case in a normal high os load, ie. i can type and > run commands with normal responsiveness.) > Well, that is normal I think. Certain objects become unresponsive if a PG is not serving I/O. With a simple 'ls' or 'df -h' you might not be touching those objects, so for you it seems like everything is functioning. The nfsd process however might be hung due to a blocking I/O call. That is completely normal and to be excpected. That it hangs the complete NFS server might be just a side-effect on how nfsd was written. It might be that Ganesha works better for you: http://blog.widodh.nl/2014/12/nfs-ganesha-with-libcephfs-on-ubuntu-14-04/ > The behavior comes accross like there is some nfsd global lock that an > nfsd sets before requesting I/O from a backend device. In the case > above, the I/O request hangs on one RBD image affected by the I/O block > caused by the problematic pg or OSD. The nfsd request blocks on the > ceph I/O and because it has set a global lock, all other nfsd processes > are prevented from servicing requests to their clients. The nfsd > processes are now all in the wait queue causing the load number on the > gateway system to spike. Once the Ceph I/O issues is resolved, the nfsd > I/O request completes and all service returns to normal. The load on > the gateway drops to normal immediately and all NFS clients can again > interact with the nfsd processes. Thoughout this time unaffected ceph > objects remain available to other clients, eg. OpenStack volumes. > > Our RBD-NFS gateway is running Ubuntu 12.04.5 with kernel > 3.11.0-15-generic. The ceph version installed on this client is 0.72.2, > though I assume only the kernel resident RBD module matters. > > Any thoughts or pointers appreciated. > > ~jpr > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Wido den Hollander 42on B.V. Ceph trainer and consultant Phone: +31 (0)20 700 9902 Skype: contact42on _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com