Traceback from core file (16GB) #0 0x00007f0e3d924925 in raise () from /lib64/libc.so.6 #1 0x00007f0e3d926105 in abort () from /lib64/libc.so.6 #2 0x00007f0e3d962837 in __libc_message () from /lib64/libc.so.6 #3 0x00007f0e3d968166 in malloc_printerr () from /lib64/libc.so.6 #4 0x00007f0e3f0e2e0f in rpcsvc_drc_op_destroy (drc=0x21e2780, reply=0x7f0e302ee470) at rpc-drc.c:47 #5 0x00007f0e3f331bc1 in rb_destroy (tree=0x7f0e302ee02c, destroy=0x7f0e3f0e2e80 <rpcsvc_drc_rb_op_destroy>) at ../../contrib/rbtree/rb.c:876 #6 0x00007f0e3f0e2b5f in rpcsvc_remove_drc_client (drc=0x21e2780, client=0x230e540) at rpc-drc.c:84 #7 rpcsvc_drc_client_unref (drc=0x21e2780, client=0x230e540) at rpc-drc.c:316 #8 0x00007f0e3f0e2c98 in rpcsvc_drc_notify (svc=<value optimized out>, xl=<value optimized out>, event=<value optimized out>, data=0x230f670) at rpc-drc.c:683 #9 0x00007f0e3f0d9d35 in rpcsvc_handle_disconnect (svc=0x21a6990, trans=0x230f670) at rpcsvc.c:682 #10 0x00007f0e3f0db880 in rpcsvc_notify (trans=0x230f670, mydata=<value optimized out>, event=<value optimized out>, data=0x230f670) at rpcsvc.c:720 #11 0x00007f0e3f0dcf98 in rpc_transport_notify (this=<value optimized out>, event=<value optimized out>, data=<value optimized out>) at rpc-transport.c:512 #12 0x00007f0e3a93c9a1 in socket_event_poll_err (fd=<value optimized out>, idx=<value optimized out>, data=0x230f670, poll_in=<value optimized out>, poll_out=0, poll_err=0) at socket.c:1071 #13 socket_event_handler (fd=<value optimized out>, idx=<value optimized out>, data=0x230f670, poll_in=<value optimized out>, poll_out=0, poll_err=0) at socket.c:2239 #14 0x00007f0e3f3512f7 in event_dispatch_epoll_handler (event_pool=0x2186ef0) at event-epoll.c:384 #15 event_dispatch_epoll (event_pool=0x2186ef0) at event-epoll.c:445 #16 0x00000000004075e4 in main (argc=11, argv=0x7fffabef9e38) at glusterfsd.c:1983 On Mon, 2014-05-19 at 14:39 +0800, Franco Broi wrote: > Just had an NFS crash on my test system running 3.5. > > Load of messages like this: > > [2014-05-19 06:24:59.347147] E [rpc-drc.c:499:rpcsvc_add_op_to_cache] 0-rpc-service: DRC failed to detect duplicates > [2014-05-19 06:24:59.347240] E [rpc-drc.c:499:rpcsvc_add_op_to_cache] 0-rpc-service: DRC failed to detect duplicates > [2014-05-19 06:24:59.347340] E [rpc-drc.c:499:rpcsvc_add_op_to_cache] 0-rpc-service: DRC failed to detect duplicates > [2014-05-19 06:24:59.347408] E [rpc-drc.c:499:rpcsvc_add_op_to_cache] 0-rpc-service: DRC failed to detect duplicates > > followed by: > > .... > frame : type(0) op(0) > frame : type(0) op(0) > frame : type(0) op(0) > > patchset: git://git.gluster.com/glusterfs.git > signal received: 6 > time of crash: 2014-05-19 06:25:13configuration details: > argp 1 > backtrace 1 > dlfcn 1 > fdatasync 1 > libpthread 1 > llistxattr 1 > setfsid 1 > spinlock 1 > epoll.h 1 > xattr.h 1 > st_atim.tv_nsec 1 > package-string: glusterfs 3.5.0 > /lib64/libc.so.6(+0x329a0)[0x7f0e3d9249a0] > /lib64/libc.so.6(gsignal+0x35)[0x7f0e3d924925] > /lib64/libc.so.6(abort+0x175)[0x7f0e3d926105] > /lib64/libc.so.6(+0x70837)[0x7f0e3d962837] > /lib64/libc.so.6(+0x76166)[0x7f0e3d968166] > /usr/lib64/libgfrpc.so.0(+0x10e0f)[0x7f0e3f0e2e0f] > /usr/lib64/libglusterfs.so.0(rb_destroy+0x51)[0x7f0e3f331bc1] > /usr/lib64/libgfrpc.so.0(+0x10b5f)[0x7f0e3f0e2b5f] > /usr/lib64/libgfrpc.so.0(rpcsvc_drc_notify+0xe8)[0x7f0e3f0e2c98] > /usr/lib64/libgfrpc.so.0(rpcsvc_handle_disconnect+0x105)[0x7f0e3f0d9d35] > /usr/lib64/libgfrpc.so.0(rpcsvc_notify+0x1a0)[0x7f0e3f0db880] > /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x28)[0x7f0e3f0dcf98] > /usr/lib64/glusterfs/3.5.0/rpc-transport/socket.so(+0xa9a1)[0x7f0e3a93c9a1] > /usr/lib64/libglusterfs.so.0(+0x672f7)[0x7f0e3f3512f7] > /usr/sbin/glusterfs(main+0x564)[0x4075e4] > /lib64/libc.so.6(__libc_start_main+0xfd)[0x7f0e3d910d1d] > /usr/sbin/glusterfs[0x404679] > > Volume Name: data2 > Type: Distribute > Volume ID: d958423f-bd25-49f1-81f8-f12e4edc6823 > Status: Started > Number of Bricks: 8 > Transport-type: tcp > Bricks: > Brick1: nas5-10g:/data17/gvol > Brick2: nas5-10g:/data18/gvol > Brick3: nas5-10g:/data19/gvol > Brick4: nas5-10g:/data20/gvol > Brick5: nas6-10g:/data21/gvol > Brick6: nas6-10g:/data22/gvol > Brick7: nas6-10g:/data23/gvol > Brick8: nas6-10g:/data24/gvol > Options Reconfigured: > cluster.min-free-disk: 5% > network.frame-timeout: 10800 > cluster.readdir-optimize: on > nfs.disable: off > nfs.export-volumes: on > performance.readdir-ahead: off > > > > On Thu, 2014-05-01 at 09:55 +0800, Franco Broi wrote: > > Installed 3.4.3 exactly 2 weeks ago on all our brick servers and I'm > > happy to report that we've not had a crash since. > > > > Thanks for all the good work. > > > > On Tue, 2014-04-15 at 14:22 +0800, Franco Broi wrote: > > > The whole system came to a grinding halt today and no amount of > > > restarting daemons would make it work again. What was really odd was > > > that gluster vol status said everything was fine and yet all the client > > > mount points had hung. > > > > > > On the node that was exporting Gluster NFS I had zombie processes so I > > > decided to reboot, took a while for the ZFS JBOD's to sort themselves > > > out but I was relieved when it all came back up - except that the df > > > size on the clients was wrong... > > > > > > gluster vol info and gluster vol status said everything was fine but it > > > was obvious that 2 of my bricks were missing. I restarted everything, > > > and still 2 missing brick. I remounted the fuse clients and still no > > > good. > > > > > > Just out of sheer desperation and for no good reason I disabled the > > > Gluster NFS export and magically the missing 2 bricks reappeared and the > > > filesystem was back to its normal size. I turned NFS exports back on and > > > everything stayed working. > > > > > > I'm not trying to belittle all the good work done by the Gluster > > > developers but this really doesn't look like a viable big data > > > filesystem at the moment. We've currently got 800TB and are about to add > > > another 400TB but quite honestly the prospect terrifies me. > > > > > > > > > On Tue, 2014-04-15 at 08:35 +0800, Franco Broi wrote: > > > > On Mon, 2014-04-14 at 17:29 -0700, Harshavardhana wrote: > > > > > > > > > > > > Just distributed. > > > > > > > > > > > > > > > > Pure distributed setup you have to take a downtime, since the data > > > > > isn't replicated. > > > > > > > > If I shutdown the server processes, wont the clients just wait for it to > > > > come back up? Ie like NFS hard mounts? I don't mind an interruption, I > > > > just want to avoid killing all jobs that are currently accessing the > > > > filesystem if at all possible, our users have suffered a lot recently > > > > with filesystem outages. > > > > > > > > By the way, how does one shutdown the glusterfs processes without > > > > stopping a volume? It would be nice to have a quiesce or freeze option > > > > that just stalls all access while maintenance takes place. > > > > > > > > > > > > > > >> > > > > > >> > 3.4.1 to 3.4.3-3 shouldn't cause problems with existing clients and > > > > > >> > other servers, right? > > > > > >> > > > > > > >> > > > > > >> You mean 3.4.1 and 3.4.3 co-existent with in a cluster? > > > > > > > > > > > > Yes, at least for the duration of the upgrade. > > > > > > > > > > Yeah 3.4.x series is backward compatible to each other in any case. > > > > > > > > > > > > > > > > > _______________________________________________ > > > > Gluster-users mailing list > > > > Gluster-users@xxxxxxxxxxx > > > > http://supercolony.gluster.org/mailman/listinfo/gluster-users > > > > > > > > > _______________________________________________ > > > Gluster-users mailing list > > > Gluster-users@xxxxxxxxxxx > > > http://supercolony.gluster.org/mailman/listinfo/gluster-users > > > > > _______________________________________________ > Gluster-users mailing list > Gluster-users@xxxxxxxxxxx > http://supercolony.gluster.org/mailman/listinfo/gluster-users _______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://supercolony.gluster.org/mailman/listinfo/gluster-users