On Wed, Jan 11, 2017 at 11:58:29AM -0700, Paul Allen wrote: > I'm running into an issue where the gluster nfs service keeps dying on a > new cluster I have setup recently. We've been using Gluster on several > other clusters now for about a year or so and I have never seen this > issue before, nor have I been able to find anything remotely similar to > it while searching on-line. I initially was using the latest version in > the Gluster Debian repository for Jessie, 3.9.0-1, and then I tried > using the next one down, 3.8.7-1. Both behave the same for me. > > What I was seeing was after a while the nfs service on the NAS server > would suddenly die after a number of processes had run on the app server > I had connected to the new NAS servers for testing (we're upgrading the > NAS servers for this cluster to newer hardware and expanded storage, the > current production NAS servers are using nfs-kernel-server with no type > of clustering of the data). I checked the logs but all it showed me was > something that looked like a stack trace in the nfs.log and the > glustershd.log showed the nfs service disconnecting. I turned on > debugging but it didn't give me a whole lot more, and certainly nothing > that helps me identify the source of my issue. It is pretty consistent > in dying shortly after I mount the file system on the servers and start > testing, usually within 15-30 minutes. But if I have nothing using the > file system, mounted or no, the service stays running for days. I tried > mounting it using the gluster client, and it works fine, but I can;t use > that due to the performance penalty, it slows the websites down by a few > seconds at a minimum. This seems to be related to the NLM protocol that Gluster/NFS provides. Earlier this week one of our Red Hat quality engineers also reported this (or a very similar) problem. https://bugzilla.redhat.com/show_bug.cgi?id=1411344 At the moment I suspect that this is related to re-connects of some kind, but I have not been able to identify the cause sufficiently to be sure. This definitely is a coding problem in Gluster/NFS, but the more I look at the NLM implementation, the more potential issues I see with it. If the workload does not require locking operations, you may be able to work around the problem by mounting with "-o nolock". Depending on the application, this can be safe or cause data corruption... An other alternative is to use NFS-Ganesha instead of Gluster/NFS. Ganesha is more mature than Gluster/NFS and is more actively developed. Gluster/NFS is being deprecated in favour of NFS-Ganesha. HTH, Niels > > Here is the output from the logs one of the times it died: > > glustershd.log: > > [2017-01-10 19:06:20.265918] W [socket.c:588:__socket_rwv] 0-nfs: readv > on /var/run/gluster/a921bec34928e8380280358a30865cee.socket failed (No > data available) > [2017-01-10 19:06:20.265964] I [MSGID: 106006] > [glusterd-svc-mgmt.c:327:glusterd_svc_common_rpc_notify] 0-management: > nfs has disconnected from glusterd. > > > nfs.log: > > [2017-01-10 19:06:20.135430] D [name.c:168:client_fill_address_family] > 0-NLM-client: address-family not specified, marking it as unspec for > getaddrinfo to resolve from (remote-host: 10.20.5.13) > [2017-01-10 19:06:20.135531] D [MSGID: 0] > [common-utils.c:335:gf_resolve_ip6] 0-resolver: returning ip-10.20.5.13 > (port-48963) for hostname: 10.20.5.13 and port: 48963 > [2017-01-10 19:06:20.136569] D [logging.c:1764:gf_log_flush_extra_msgs] > 0-logging-infra: Log buffer size reduced. About to flush 5 extra log > messages > [2017-01-10 19:06:20.136630] D [logging.c:1767:gf_log_flush_extra_msgs] > 0-logging-infra: Just flushed 5 extra log messages > pending frames: > frame : type(0) op(0) > patchset: git://git.gluster.com/glusterfs.git > signal received: 11 > time of crash: > 2017-01-10 19:06:20 > configuration details: > argp 1 > backtrace 1 > dlfcn 1 > libpthread 1 > llistxattr 1 > setfsid 1 > spinlock 1 > epoll.h 1 > xattr.h 1 > st_atim.tv_nsec 1 > package-string: glusterfs 3.9.0 > /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_msg_backtrace_nomem+0xac)[0x7f891f0846ac] > /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(gf_print_trace+0x324)[0x7f891f08dcc4] > /lib/x86_64-linux-gnu/libc.so.6(+0x350e0)[0x7f891db870e0] > /lib/x86_64-linux-gnu/libc.so.6(+0x91d8a)[0x7f891dbe3d8a] > /usr/lib/x86_64-linux-gnu/glusterfs/3.9.0/xlator/nfs/server.so(+0x3a352)[0x7f8918682352] > /usr/lib/x86_64-linux-gnu/glusterfs/3.9.0/xlator/nfs/server.so(+0x3cc15)[0x7f8918684c15] > /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_notify+0x2aa)[0x7f891ee4e4da] > /usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f891ee4a7e3] > /usr/lib/x86_64-linux-gnu/glusterfs/3.9.0/rpc-transport/socket.so(+0x4b33)[0x7f8919eadb33] > /usr/lib/x86_64-linux-gnu/glusterfs/3.9.0/rpc-transport/socket.so(+0x8f07)[0x7f8919eb1f07] > /usr/lib/x86_64-linux-gnu/libglusterfs.so.0(+0x7e836)[0x7f891f0d9836] > /lib/x86_64-linux-gnu/libpthread.so.0(+0x80a4)[0x7f891e3010a4] > /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f891dc3a62d] > > > The IP showing in the nfs.log is actually for a web server I was also > testing with, not the app server, but it doesn't appear to me that would > be the cause for the nfs service dying. I'm at a loss as to what is > going on, and I need to try and get this fixed pretty quickly here, I > was hoping to have this in production last Friday. If anyone has any > ideas I'd be very grateful. > > -- > > Paul Allen > > Inetz System Administrator > > > _______________________________________________ > Gluster-users mailing list > Gluster-users@xxxxxxxxxxx > http://www.gluster.org/mailman/listinfo/gluster-users
Attachment:
signature.asc
Description: PGP signature
_______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-users