Self Heal/Recovery Problem

Kamil Srot <kamil.srot@xxxxxxxxx> · Mon, 15 Oct 2007 18:36:34 +0200

Dear Gluster developers, fans,

as first thing, I want to say big THANK YOU for the work you do. From 
what I saw and tried out (OCFS2, GFS2, CODA, NFS), your system is the 
first one I like and in some way understand the logic behind it. The 
others seems to be too complex, hard to understand and possibly 
reconfigure in case something wents wrong.
I worked like 2 months with my test setup of OCFS (which is the simplest 
"other" solution of FS clustering) and dont have so nice feeling about 
it than after few days with GlusterFS...

Well, it wouldn't a good post into devel group w/o questions - so I'm 
composing in another window few questions regarding performance/tuning 
of my setup, but recently I run into issue.

I have quite simple setup with two servers doing mirror of data with afr 
*:2 and unify and io-threads...
The setup worked fine for several days of stress testing but recently I 
found article recommending to use some format parameter of underlaying 
XFS filesystem...
So I stopped glfs and glfsd on one of the servers and formatted the 
device... have created the exported directories and started the glfsd & 
glfs again... then I tried to kick start the self heal do remirror the 
testing data fith the find -mountpoint -type f ... ops, the glfsd 
segfaults after few seconds - in the log, I have:

The glfs is: mainline--2.5--patch-518

---------
got signal (11), printing backtrace
---------
[0xb7f7f420]
/cluster/lib/glusterfs/1.3.5/xlator/protocol/client.so[0xb7604432]
/cluster/lib/glusterfs/1.3.5/xlator/protocol/client.so[0xb7606a4b]
/cluster/lib/glusterfs/1.3.5/xlator/protocol/client.so(notify+0xe5)[0xb7607666]
/cluster/lib/libglusterfs.so.0(transport_notify+0x62)[0xb7f70a92]
/cluster/lib/libglusterfs.so.0[0xb7f712fc]
/cluster/lib/libglusterfs.so.0(sys_epoll_iteration+0x16b)[0xb7f71642]
/cluster/lib/libglusterfs.so.0(poll_iteration+0x3b)[0xb7f70dce]
[glusterfsd](main+0x4e3)[0x804991d]
/lib/tls/libc.so.6(__libc_start_main+0xc8)[0xb7e27ea8]
[glusterfsd][0x8048e51]
---------

And core file in root directory... the backtrace is:
#0  0xb75574f8 in afr_sync_ownership_permission ()
  from /cluster/lib/glusterfs/1.3.5/xlator/cluster/afr.so
#1  0xb7576432 in client_closedir_cbk ()
  from /cluster/lib/glusterfs/1.3.5/xlator/protocol/client.so
#2  0xb7578a4b in client_protocol_interpret ()
  from /cluster/lib/glusterfs/1.3.5/xlator/protocol/client.so
#3  0xb7579666 in notify () from 
/cluster/lib/glusterfs/1.3.5/xlator/protocol/client.so
#4  0xb7edfa92 in transport_notify (this=0x8053848, event=1) at 
transport.c:154
#5  0xb7ee02fc in epoll_notify (eevent=1, data=0x8053848) at epoll.c:53
#6  0xb7ee0642 in sys_epoll_iteration (ctx=0xbfb026d4) at epoll.c:155
#7  0xb7edfdce in poll_iteration (ctx=0xbfb026d4) at transport.c:300
#8  0x0804991d in main ()

It seems to be some problem with permissions?

Any hints/help is greatly appreciated!

*glusterfs-server.vol*
volume mailspool-ds
   type storage/posix
   option directory /data/mailspool-ds
end-volume

volume mailspool-ns
   type storage/posix
   option directory /data/mailspool-ns
end-volume

volume mailspool-san1-ds
   type protocol/client
   option transport-type tcp/client
   option remote-host 10.0.0.110
   option remote-subvolume mailspool-ds
end-volume

volume mailspool-san1-ns
   type protocol/client
   option transport-type tcp/client
   option remote-host 10.0.0.110
   option remote-subvolume mailspool-ns
end-volume

volume mailspool-ns-afr
   type cluster/afr
   subvolumes mailspool-ns mailspool-san1-ns
   option replicate *:2
end-volume

volume mailspool-ds-afr
   type cluster/afr
   subvolumes mailspool-ds mailspool-san1-ds
   option replicate *:2
end-volume

volume mailspool-unify
   type cluster/unify
   subvolumes mailspool-ds-afr
   option namespace mailspool-ns-afr
   option scheduler random
end-volume
volume mailspool
   type performance/io-threads
   option thread-count 8
   option cache-size 64MB
   subvolumes mailspool-unify
end-volume

volume server
   type protocol/server
   option transport-type tcp/server
   subvolumes mailspool
   option auth.ip.mailspool-ds.allow 10.0.0.*,127.0.0.1
   option auth.ip.mailspool-ns.allow 10.0.0.*,127.0.0.1
   option auth.ip.mailspool.allow *
end-volume

*glusterfs-client.vol
*volume client
   type protocol/client
   option transport-type tcp/client
   option remote-host 127.0.0.1
   option remote-subvolume mailspool
end-volume

volume writebehind
   type performance/write-behind
   option aggregate-size 131072 # aggregate block size in bytes
   subvolumes client
end-volume

volume readahead
   type performance/read-ahead
   option page-size 131072
   option page-count 2
   subvolumes writebehind
end-volume

volume iothreads    #iothreads can give performance a boost
   type performance/io-threads
   option thread-count 8
   option cache-size 64MB
   subvolumes readahead
end-volume*
*

Best Regards,
--
Kamil