Re: hanging nfsd requests on an RBD to NFS gateway

Wido den Hollander <wido@xxxxxxxx> · Thu, 22 Oct 2015 23:03:40 +0200

On 10/22/2015 10:57 PM, John-Paul Robinson wrote:
> Hi,
> 
> Has anyone else experienced a problem with RBD-to-NFS gateways blocking
> nfsd server requests when their ceph cluster has a placement group that
> is not servicing I/O for some reason, eg. too few replicas or an osd
> with slow request warnings?
> 
> We have an RBD-NFS gateway that stops responding to NFS clients
> (interaction with RBD-backed NFS shares hang on the NFS client),
> whenever our ceph cluster has some part of it in an I/O block
> condition.   This issue only affects the ability of the nfsd processes
> to serve requests to the client.  I can look at and access underlying
> mounted RBD containers without issue, although they appear hung from the
> NFS client side.   The gateway node load numbers spike to a number that
> reflects the number of nfsd processes, but the system is otherwise
> untaxed (unlike the case in a normal high os load, ie. i can type and
> run commands with normal responsiveness.)
> 

Well, that is normal I think. Certain objects become unresponsive if a
PG is not serving I/O.

With a simple 'ls' or 'df -h' you might not be touching those objects,
so for you it seems like everything is functioning.

The nfsd process however might be hung due to a blocking I/O call. That
is completely normal and to be excpected.

That it hangs the complete NFS server might be just a side-effect on how
nfsd was written.

It might be that Ganesha works better for you:
http://blog.widodh.nl/2014/12/nfs-ganesha-with-libcephfs-on-ubuntu-14-04/

> The behavior comes accross like there is some nfsd global lock that an
> nfsd sets before requesting I/O from a backend device.  In the case
> above, the I/O request hangs on one RBD image affected by the I/O block
> caused by the problematic pg or OSD.   The nfsd request blocks on the
> ceph I/O and because it has set a global lock, all other nfsd processes
> are prevented from servicing requests to their clients.  The nfsd
> processes are now all in the wait queue causing the load number on the
> gateway system to spike. Once the Ceph I/O issues is resolved, the nfsd
> I/O request completes and all service returns to normal.  The load on
> the gateway drops to normal immediately and all NFS clients can again
> interact with the nfsd processes.  Thoughout this time unaffected ceph
> objects remain available to other clients, eg. OpenStack volumes.
> 
> Our RBD-NFS gateway is running Ubuntu 12.04.5 with kernel
> 3.11.0-15-generic.  The ceph version installed on this client is 0.72.2,
> though I assume only the kernel resident RBD module matters.
> 
> Any thoughts or pointers appreciated.
> 
> ~jpr
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com