Re: Gluster NFS crashing

Franco Broi <franco.broi@xxxxxxxxxx> · Mon, 19 May 2014 14:39:09 +0800

Just had an NFS crash on my test system running 3.5.

Load of messages like this:

[2014-05-19 06:24:59.347147] E [rpc-drc.c:499:rpcsvc_add_op_to_cache] 0-rpc-service: DRC failed to detect duplicates
[2014-05-19 06:24:59.347240] E [rpc-drc.c:499:rpcsvc_add_op_to_cache] 0-rpc-service: DRC failed to detect duplicates
[2014-05-19 06:24:59.347340] E [rpc-drc.c:499:rpcsvc_add_op_to_cache] 0-rpc-service: DRC failed to detect duplicates
[2014-05-19 06:24:59.347408] E [rpc-drc.c:499:rpcsvc_add_op_to_cache] 0-rpc-service: DRC failed to detect duplicates

followed by:

....
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)

patchset: git://git.gluster.com/glusterfs.git
signal received: 6
time of crash: 2014-05-19 06:25:13configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.5.0
/lib64/libc.so.6(+0x329a0)[0x7f0e3d9249a0]
/lib64/libc.so.6(gsignal+0x35)[0x7f0e3d924925]
/lib64/libc.so.6(abort+0x175)[0x7f0e3d926105]
/lib64/libc.so.6(+0x70837)[0x7f0e3d962837]
/lib64/libc.so.6(+0x76166)[0x7f0e3d968166]
/usr/lib64/libgfrpc.so.0(+0x10e0f)[0x7f0e3f0e2e0f]
/usr/lib64/libglusterfs.so.0(rb_destroy+0x51)[0x7f0e3f331bc1]
/usr/lib64/libgfrpc.so.0(+0x10b5f)[0x7f0e3f0e2b5f]
/usr/lib64/libgfrpc.so.0(rpcsvc_drc_notify+0xe8)[0x7f0e3f0e2c98]
/usr/lib64/libgfrpc.so.0(rpcsvc_handle_disconnect+0x105)[0x7f0e3f0d9d35]
/usr/lib64/libgfrpc.so.0(rpcsvc_notify+0x1a0)[0x7f0e3f0db880]
/usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x28)[0x7f0e3f0dcf98]
/usr/lib64/glusterfs/3.5.0/rpc-transport/socket.so(+0xa9a1)[0x7f0e3a93c9a1]
/usr/lib64/libglusterfs.so.0(+0x672f7)[0x7f0e3f3512f7]
/usr/sbin/glusterfs(main+0x564)[0x4075e4]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x7f0e3d910d1d]
/usr/sbin/glusterfs[0x404679]

Volume Name: data2
Type: Distribute
Volume ID: d958423f-bd25-49f1-81f8-f12e4edc6823
Status: Started
Number of Bricks: 8
Transport-type: tcp
Bricks:
Brick1: nas5-10g:/data17/gvol
Brick2: nas5-10g:/data18/gvol
Brick3: nas5-10g:/data19/gvol
Brick4: nas5-10g:/data20/gvol
Brick5: nas6-10g:/data21/gvol
Brick6: nas6-10g:/data22/gvol
Brick7: nas6-10g:/data23/gvol
Brick8: nas6-10g:/data24/gvol
Options Reconfigured:
cluster.min-free-disk: 5%
network.frame-timeout: 10800
cluster.readdir-optimize: on
nfs.disable: off
nfs.export-volumes: on
performance.readdir-ahead: off

On Thu, 2014-05-01 at 09:55 +0800, Franco Broi wrote: 
> Installed 3.4.3 exactly 2 weeks ago on all our brick servers and I'm
> happy to report that we've not had a crash since.
> 
> Thanks for all the good work.
> 
> On Tue, 2014-04-15 at 14:22 +0800, Franco Broi wrote: 
> > The whole system came to a grinding halt today and no amount of
> > restarting daemons would make it work again. What was really odd was
> > that gluster vol status said everything was fine and yet all the client
> > mount points had hung.
> > 
> > On the node that was exporting Gluster NFS I had zombie processes so I
> > decided to reboot, took a while for the ZFS JBOD's to sort themselves
> > out but I was relieved when it all came back up - except that the df
> > size on the clients was wrong...
> > 
> > gluster vol info and gluster vol status said everything was fine but it
> > was obvious that 2 of my bricks were missing. I restarted everything,
> > and still 2 missing brick. I remounted the fuse clients and still no
> > good.
> > 
> > Just out of sheer desperation and for no good reason I disabled the
> > Gluster NFS export and magically the missing 2 bricks reappeared and the
> > filesystem was back to its normal size. I turned NFS exports back on and
> > everything stayed working.
> > 
> > I'm not trying to belittle all the good work done by the Gluster
> > developers but this really doesn't look like a viable big data
> > filesystem at the moment. We've currently got 800TB and are about to add
> > another 400TB but quite honestly the prospect terrifies me.
> > 
> > 
> > On Tue, 2014-04-15 at 08:35 +0800, Franco Broi wrote: 
> > > On Mon, 2014-04-14 at 17:29 -0700, Harshavardhana wrote: 
> > > > >
> > > > > Just distributed.
> > > > >
> > > > 
> > > > Pure distributed setup you have to take a downtime, since the data
> > > > isn't replicated.
> > > 
> > > If I shutdown the server processes, wont the clients just wait for it to
> > > come back up? Ie like NFS hard mounts? I don't mind an interruption, I
> > > just want to avoid killing all jobs that are currently accessing the
> > > filesystem if at all possible, our users have suffered a lot recently
> > > with filesystem outages.
> > > 
> > > By the way, how does one shutdown the glusterfs processes without
> > > stopping a volume? It would be nice to have a quiesce or freeze option
> > > that just stalls all access while maintenance takes place.
> > > 
> > > > 
> > > > >>
> > > > >> > 3.4.1 to 3.4.3-3 shouldn't cause problems with existing clients and
> > > > >> > other servers, right?
> > > > >> >
> > > > >>
> > > > >> You mean 3.4.1 and 3.4.3 co-existent with in a cluster?
> > > > >
> > > > > Yes, at least for the duration of the upgrade.
> > > > 
> > > > Yeah 3.4.x series is backward compatible to each other in any case.
> > > > 
> > > 
> > > 
> > > _______________________________________________
> > > Gluster-users mailing list
> > > Gluster-users@xxxxxxxxxxx
> > > http://supercolony.gluster.org/mailman/listinfo/gluster-users
> > 
> > 
> > _______________________________________________
> > Gluster-users mailing list
> > Gluster-users@xxxxxxxxxxx
> > http://supercolony.gluster.org/mailman/listinfo/gluster-users
> 

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-users