Re: NFS service dying

Niels de Vos <ndevos@xxxxxxxxxx> · Fri, 13 Jan 2017 12:39:11 +0100

On Wed, Jan 11, 2017 at 11:58:29AM -0700, Paul Allen wrote:
> I'm running into an issue where the gluster nfs service keeps dying on a
> new cluster I have setup recently. We've been using Gluster on several
> other clusters now for about a year or so and I have never seen this
> issue before, nor have I been able to find anything remotely similar to
> it while searching on-line. I initially was using the latest version in
> the Gluster Debian repository for Jessie, 3.9.0-1, and then I tried
> using the next one down, 3.8.7-1. Both behave the same for me.
> 
> What I was seeing was after a while the nfs service on the NAS server
> would suddenly die after a number of processes had run on the app server
> I had connected to the new NAS servers for testing (we're upgrading the
> NAS servers for this cluster to newer hardware and expanded storage, the
> current production NAS servers are using nfs-kernel-server with no type
> of clustering of the data). I checked the logs but all it showed me was
> something that looked like a stack trace in the nfs.log and the
> glustershd.log showed the nfs service disconnecting. I turned on
> debugging but it didn't give me a whole lot more, and certainly nothing
> that helps me identify the source of my issue. It is pretty consistent
> in dying shortly after I mount the file system on the servers and start
> testing, usually within 15-30 minutes. But if I have nothing using the
> file system, mounted or no, the service stays running for days. I tried
> mounting it using the gluster client, and it works fine, but I can;t use
> that due to the performance penalty, it slows the websites down by a few
> seconds at a minimum.

This seems to be related to the NLM protocol that Gluster/NFS provides.
Earlier this week one of our Red Hat quality engineers also reported
this (or a very similar) problem.

https://bugzilla.redhat.com/show_bug.cgi?id=1411344

At the moment I suspect that this is related to re-connects of some
kind, but I have not been able to identify the cause sufficiently to be
sure. This definitely is a coding problem in Gluster/NFS, but the more I
look at the NLM implementation, the more potential issues I see with it.

If the workload does not require locking operations, you may be able to
work around the problem by mounting with "-o nolock". Depending on the
application, this can be safe or cause data corruption...

An other alternative is to use NFS-Ganesha instead of Gluster/NFS.
Ganesha is more mature than Gluster/NFS and is more actively developed.
Gluster/NFS is being deprecated in favour of NFS-Ganesha.

HTH,
Niels

> 
> Here is the output from the logs one of the times it died:
> 
> glustershd.log:
> 
> [2017-01-10 19:06:20.265918] W [socket.c:588:__socket_rwv] 0-nfs: readv
> on /var/run/gluster/a921bec34928e8380280358a30865cee.socket failed (No
> data available)
> [2017-01-10 19:06:20.265964] I [MSGID: 106006]
> [glusterd-svc-mgmt.c:327:glusterd_svc_common_rpc_notify] 0-management:
> nfs has disconnected from glusterd.
> 
> 
> nfs.log:
> 
> [2017-01-10 19:06:20.135430] D [name.c:168:client_fill_address_family]
> 0-NLM-client: address-family not specified, marking it as unspec for
> getaddrinfo to resolve from (remote-host: 10.20.5.13)
> [2017-01-10 19:06:20.135531] D [MSGID: 0]
> [common-utils.c:335:gf_resolve_ip6] 0-resolver: returning ip-10.20.5.13
> (port-48963) for hostname: 10.20.5.13 and port: 48963
> [2017-01-10 19:06:20.136569] D [logging.c:1764:gf_log_flush_extra_msgs]
> 0-logging-infra: Log buffer size reduced. About to flush 5 extra log
> messages
> [2017-01-10 19:06:20.136630] D [logging.c:1767:gf_log_flush_extra_msgs]
> 0-logging-infra: Just flushed 5 extra log messages
> pending frames:
> frame : type(0) op(0)
> patchset: git://git.gluster.com/glusterfs.git
> signal received: 11
> time of crash:
> 2017-01-10 19:06:20
> configuration details:
> argp 1
> backtrace 1
> dlfcn 1
> libpthread 1
> llistxattr 1
> setfsid 1
> spinlock 1
> epoll.h 1
> xattr.h 1
> st_atim.tv_nsec 1
> package-string: glusterfs 3.9.0
> /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xac)[0x7f891f0846ac]
> /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(gf_print_trace+0x324)[0x7f891f08dcc4]
> /lib/x86_64-linux-gnu/libc.so.6(+0x350e0)[0x7f891db870e0]
> /lib/x86_64-linux-gnu/libc.so.6(+0x91d8a)[0x7f891dbe3d8a]
> /usr/lib/x86_64-linux-gnu/glusterfs/3.9.0/xlator/nfs/server.so(+0x3a352)[0x7f8918682352]
> /usr/lib/x86_64-linux-gnu/glusterfs/3.9.0/xlator/nfs/server.so(+0x3cc15)[0x7f8918684c15]
> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_notify+0x2aa)[0x7f891ee4e4da]
> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f891ee4a7e3]
> /usr/lib/x86_64-linux-gnu/glusterfs/3.9.0/rpc-transport/socket.so(+0x4b33)[0x7f8919eadb33]
> /usr/lib/x86_64-linux-gnu/glusterfs/3.9.0/rpc-transport/socket.so(+0x8f07)[0x7f8919eb1f07]
> /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x7e836)[0x7f891f0d9836]
> /lib/x86_64-linux-gnu/libpthread.so.0(+0x80a4)[0x7f891e3010a4]
> /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f891dc3a62d]
> 
> 
> The IP showing in the nfs.log is actually for a web server I was also
> testing with, not the app server, but it doesn't appear to me that would
> be the cause for the nfs service dying. I'm at a loss as to what is
> going on, and I need to try and get this fixed pretty quickly here, I
> was hoping to have this in production last Friday. If anyone has any
> ideas I'd be very grateful.
> 
> -- 
> 
> Paul Allen
> 
> Inetz System Administrator
> 
> 
> _______________________________________________
> Gluster-users mailing list
> Gluster-users@xxxxxxxxxxx
> http://www.gluster.org/mailman/listinfo/gluster-users
Attachment:
signature.asc

Description: PGP signature
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users