Greetings. It's my first post to this list so please bear with me while I try and flesh out the segfault I saw yesterday ... Call me brave, call me stupid - without enough equipment on which to test things I have plunged glusterfs 1.3.12 straight into production on a small Opteron based cluster. The 14 clients are either 2 or 4 way Opteron driven (44 core all up) running on amd64 Gentoo with a 2.6.20 kernel and the Gluster 2.7.3 fuse module. Running the same Gentoo as the clients the two servers are 4 way Opteron, dual homed (GigE) with a glusterfsd per network connection, sharing out 250G per daemon. Yesterday the glusterfs process on one of the 2 way clients went to 100%. Attaching an strace to it showed it repeatedly calling nanosleep. Since the machine needed to be back online quickly (oh for the budget of LANL!) I tried to ctrl-c the strace, then sigterm, then had to sigkill it. The sigterm must have got through to the glusterfs process because the log on the client contains: "2009-01-14 14:01:53 W [glusterfs.c:416:glusterfs_cleanup_and_exit] glusterfs: shutting down server" There were no log entries made when it was running at 100%. The problem on the client was first noticed when a user tried to tab-complete a directory listing of the gluster mounted file system. The gluster client was restarted. It was only a couple of hours later when some of the users reported issues that I noticed one of the glusterfsd's had died on a server. The glusterfsd segfault on the server coincides with killing the glusterfs on the client. I haven't compiled gluster with debug, so following are entries from the server logs, client config, and a backtrace of the core dump (which unfortunately mirrors what's in the logs). Side note: in an earlier 1.3.12 config we were running stripe across two glusterfsd backends. It proved to be quite unstable (specifically with directories sometimes not sync'ing on the backends) compared to the unify+namespace config. Otherwise glusterfs seems to be all round easier to install and use compared to my first cluster filesystem attempt with PVFS. Contents of /var/log/glusterfsd.log: ==================================== 2009-01-14 14:01:53 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (172.17.231.162:1016) 2009-01-14 14:01:53 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (172.17.231.162:1017) TLA Repo Revision: glusterfs--mainline--2.5--patch-797 Time : 2009-01-14 14:01:53 Signal Number : 11 glusterfsd -f /etc/glusterfs/glusterfs-server-shareda.vol -l /var/log/glusterfs/glusterfsd.log -L WARNING volume server type protocol/server option auth.ip.nsbricka.allow * option auth.ip.hans.allow * option auth.ip.data.allow * option bind-address 172.17.231.170 option transport-type tcp/server subvolumes data hans nsbricka end-volume volume data type performance/io-threads option cache-size 128M option thread-count 4 subvolumes databrick end-volume volume databrick type storage/posix option directory /var/local/shareda end-volume volume hans type cluster/afr subvolumes nsbricka nsbrickb end-volume volume nsbrickb type protocol/client option remote-subvolume nsbricka option remote-host maelstroma9 option transport-type tcp/client end-volume volume nsbricka type storage/posix option directory /var/local/namespace end-volume frame : type(0) op(0) frame : type(0) op(0) 2009-01-14 14:01:53 E [protocol.c:271:gf_block_unserialize_transport] server: EOF from peer (172.17.231.162:1015) /lib/libc.so.6[0x2af3d0e0f940] /usr/lib64/glusterfs/1.3.12/xlator/cluster/afr.so(afr_close+0x140)[0x2aaaaacd37d0] /usr/lib64/glusterfs/1.3.12/xlator/protocol/server.so(server_protocol_cleanup+0x1af)[0x2aaaaaef80cf] /usr/lib64/glusterfs/1.3.12/xlator/protocol/server.so(notify+0x6e)[0x2aaaaaef853e] /usr/lib/libglusterfs.so.0(transport_unref+0x64)[0x2af3d0ab32b4] /usr/lib64/glusterfs/1.3.12/transport/tcp/client.so(tcp_disconnect+0x7d)[0x2aaaaaffdcfd] /usr/lib64/glusterfs/1.3.12/xlator/protocol/server.so(notify+0x61)[0x2aaaaaef8531] /usr/lib/libglusterfs.so.0(sys_epoll_iteration+0xbb)[0x2af3d0ab3c4b] /usr/lib/libglusterfs.so.0(poll_iteration+0x78)[0x2af3d0ab3008] [glusterfs](main+0x67c)[0x40288c] /lib/libc.so.6(__libc_start_main+0xf4)[0x2af3d0dfd374] [glusterfs][0x401d59] --------- ==================================== end of glusterfsd.log /etc/glusterfs/glusterfs-client.vol: ==================================== volume brick1 type protocol/client option transport-type tcp/client # for TCP/IP transport option remote-host maelstroma0 option transport-timeout 120 option remote-subvolume data # name of the remote volume end-volume volume brick2 type protocol/client option transport-type tcp/client # for TCP/IP transport option remote-host maelstroma0a option transport-timeout 120 option remote-subvolume data # name of the remote volume end-volume volume brick3 type protocol/client option transport-type tcp/client # for TCP/IP transport option remote-host maelstroma9 option transport-timeout 120 option remote-subvolume data # name of the remote volume end-volume volume brick4 type protocol/client option transport-type tcp/client # for TCP/IP transport option remote-host maelstroma9a option transport-timeout 120 option remote-subvolume data # name of the remote volume end-volume volume ns type protocol/client option transport-type tcp/client option remote-host gluster option transport-timeout 120 option remote-subvolume hans end-volume volume unify type cluster/unify option scheduler rr option rr.limits.min-free-disk 5 option namespace ns subvolumes brick1 brick2 brick3 brick4 end-volume volume iothreads type performance/io-threads #option thread-count 8 option thread-count 4 option cache-size 64M subvolumes unify end-volume volume readahead type performance/read-ahead option page-size 1024kb option page-count 10 subvolumes iothreads end-volume volume iocache type performance/io-cache option cache-size 64MB #default 32M option page-size 1MB #default 128kb subvolumes readahead end-volume volume writebehind type performance/write-behind option aggregate-size 1MB option flush-behind off subvolumes iocache end-volume ==================================== end of glusterfs-client.vol gdb backtrace: ==================================== gdb /usr/sbin/glusterfsd /core.28935 GNU gdb 6.7.1 Copyright (C) 2007 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-pc-linux-gnu"... (no debugging symbols found) Using host libthread_db library "/lib/libthread_db.so.1". Reading symbols from /usr/lib64/libglusterfs.so.0...(no debugging symbols found)...done. Loaded symbols for /usr/lib/libglusterfs.so.0 Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done. Loaded symbols for /lib/libdl.so.2 Reading symbols from /lib64/libpthread.so.0...done. Loaded symbols for /lib/libpthread.so.0 Reading symbols from /lib64/libc.so.6...done. Loaded symbols for /lib/libc.so.6 Reading symbols from /lib64/ld-linux-x86-64.so.2...done. Loaded symbols for /lib64/ld-linux-x86-64.so.2 Reading symbols from /usr/lib64/glusterfs/1.3.12/xlator/storage/posix.so...done. Loaded symbols for /usr/lib64/glusterfs/1.3.12/xlator/storage/posix.so Reading symbols from /usr/lib64/glusterfs/1.3.12/xlator/protocol/client.so...done. Loaded symbols for /usr/lib64/glusterfs/1.3.12/xlator/protocol/client.so Reading symbols from /usr/lib64/glusterfs/1.3.12/xlator/cluster/afr.so...done. Loaded symbols for /usr/lib64/glusterfs/1.3.12/xlator/cluster/afr.so Reading symbols from /usr/lib64/glusterfs/1.3.12/xlator/performance/io-threads.so...done. Loaded symbols for /usr/lib64/glusterfs/1.3.12/xlator/performance/io-threads.so Reading symbols from /usr/lib64/glusterfs/1.3.12/xlator/protocol/server.so...done. Loaded symbols for /usr/lib64/glusterfs/1.3.12/xlator/protocol/server.so Reading symbols from /usr/lib64/glusterfs/1.3.12/transport/tcp/client.so...done. Loaded symbols for /usr/lib64/glusterfs/1.3.12/transport/tcp/client.so Reading symbols from /usr/lib64/glusterfs/1.3.12/transport/tcp/server.so...done. Loaded symbols for /usr/lib64/glusterfs/1.3.12/transport/tcp/server.so Reading symbols from /usr/lib64/glusterfs/1.3.12/auth/ip.so...done. Loaded symbols for /usr/lib64/glusterfs/1.3.12/auth/ip.so Reading symbols from /lib64/libnss_files.so.2...done. Loaded symbols for /lib/libnss_files.so.2 Reading symbols from /lib64/libnss_dns.so.2...done. Loaded symbols for /lib/libnss_dns.so.2 Reading symbols from /lib64/libresolv.so.2...done. Loaded symbols for /lib/libresolv.so.2 Reading symbols from /lib64/libgcc_s.so.1...done. Loaded symbols for /lib/libgcc_s.so.1 Core was generated by `[glusterfs] '. Program terminated with signal 11, Segmentation fault. #0 0x00002aaaaacd37d0 in afr_close () from /usr/lib64/glusterfs/1.3.12/xlator/cluster/afr.so (gdb) q ==================================== end of bactrace Thanks for glusterfs Regards Matt McCowan sysadmin RPS MetOcean Perth, Western Australia