Re: hanging nfsd requests on an RBD to NFS gateway

Ryan Tokarek <tokarek@xxxxxxxxxxx> · Thu, 22 Oct 2015 23:15:10 -0500

> On Oct 22, 2015, at 10:19 PM, John-Paul Robinson <jpr@xxxxxxx> wrote:
> 
> A few clarifications on our experience:
> 
> * We have 200+ rbd images mounted on our RBD-NFS gateway.  (There's
> nothing easier for a user to understand than "your disk is full".)

Same here, and agreed. It sounds like our situations are similar except for my blocking on an apparently healthy cluster issue. 

> * I'd expect more contention potential with a single shared RBD back
> end, but with many distinct and presumably isolated backend RBD images,
> I've always been surprised that *all* the nfsd task hang.  This leads me
> to think  it's an nfsd issue rather than and rbd issue.  (I realize this
> is an rbd list, looking for shared experience. ;) )

It's definitely possible. I've experienced exactly the behavior you're seeing. My guess is that when an nfsd thread blocks and goes dark, affected clients (even if it's only one) will retransmit their requests thinking there's a network issue causing more nfsds to go dark until all the server threads are stuck (that could be hogwash, but it fits the behavior). Or perhaps there are enough individual clients writing to the affected NFS volume that they consume all the available nfsd threads (I'm not sure about your client to FS and nfsd thread ratio, but that is plausible in my situation).  I think some testing with xfs_freeze and non-critical nfs server/clients is called for. 

I don't think this part is related to ceph except that it happens to be providing the underlying storage. I'm fairly certain that my problems with an apparently healthy cluster blocking writes is a ceph problem, but I haven't figured out what the source of that is. 

> * I haven't seen any difference between reads and writes.  Any access to
> any backing RBD store from the NFS client hangs.

All NFS clients are hung, but in my situation, it's usually only 1-3 local file systems that stop accepting writes. NFS is completely unresponsive, but local and remote-samba operations on the unaffected file systems are totally happy. 

I don't have a solution to NFS issue, but I've seen it all too often. I wonder whether setting a huge number of threads and or playing with client retransmit times would help, but I suspect this problem is just intrinsic to Linux NFS servers. 

Ryan
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com