Hello all, I have been running a four-node (two servers, two clients) server-based AFR cluster for some time now, the architecture of which is described fairly accurately by the following Wiki page : http://www.gluster.org/docs/index.php/High-availability_storage_using_server-side_AFR In summary, there are two servers and two clients ; the clients are set up to connect to a single hostname, which is a round-robin DNS entry for both of the servers. Last night, glusterfsd on one of the servers crashed (w/ coredump), and instead of the remaining server being used automatically, the entire cluster became unusable. The logs for both the remaining functional server, as well as the clients, are littered with tens of thousands of error messages, and the mounted shares were not accessible. It is (was?) my understanding that Gluster is tolerant of faults wherein one of the nodes becomes inaccessible. Is this or is this not the case ? Particulars... Both servers : [root at server glusterfs]# uname -s -r -o -i Linux 2.6.25.10-86.fc9.i686 i386 GNU/Linux [root at server glusterfs]# cat /etc/redhat-release Fedora release 9 (Sulphur) GLUSTER CONFIG : http://glusterfs.pastebin.com/m45feb982 Both clients : [root at client glusterfs]# uname -s -r -o -i Linux 2.6.24.4 x86_64 GNU/Linux [root at client glusterfs]# cat /etc/redhat-release Fedora release 8 (Werewolf) GLUSTER CONFIG : http://glusterfs.pastebin.com/m48b7dd28 LOGS FROM THE INCIDENT : http://glusterfs.pastebin.com/m72cbc8f5 (excerpts from all four machines) (note the following from the server that crashed...) [0x110400] /usr/lib/libglusterfs.so.0(dict_del+0x2d)[0x808e7d] /usr/lib/glusterfs/1.3.12/xlator/protocol/client.so(notify+0x21b)[0x126a4b] /usr/lib/libglusterfs.so.0(transport_notify+0x3d)[0x81374d] /usr/lib/libglusterfs.so.0(sys_epoll_iteration+0xf9)[0x814779] /usr/lib/libglusterfs.so.0(poll_iteration+0xa0)[0x8138f0] [glusterfs](main+0x786)[0x804a156] /lib/libc.so.6(__libc_start_main+0xe6)[0xb655d6] [glusterfs][0x8049431] --------- What could have caused Gluster to crash ? Should the cluster have continued to function or not ? What, if anything, can be done to prevent this from happening in the future ? Thank you, all. -- Daniel Maher <dma+gluster AT witbe DOT net>