Hi, i have 2 machines running a simple replicate volume to provide highly available storage for kvm virtual machines. As soon as auto healing starts, glusterfs will start blocking the vm's storage access (apparently writes are what causes this) leaving the whole virtual machine hanging. I can replicate this bug on both ext3 and ext4 filesystems, on real machines as well as on vm's. Any help would be appreciated, we have to run the vm's without glusterfs at the moment because of this problem :-( More on my config: * Ubuntu 10.04 Server 64bit * Kernel 2.6.32-21-server * Fuse 2.8.1 * Glusterfs v3.0.2 How to replicate: * 2 Nodes running glusterfs replicate * Start KVM virtual machine with diskfile on glusterfs * Stop glusterfsd on one node * Make changes to the diskfile * Bring glusterfsd back online (auto healing starts) (replicate: no missing files - /image.raw. proceeding to metadata check) * As soon as the vm starts writing data, it will be blocked until autohealing finishes (Making it completely unresponsive) Message from Kernel (Printed several times while healing): INFO: task kvm:7774 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kvm D 00000000ffffffff 0 7774 1 0x00000000 ffff8801adcd9e48 0000000000000082 0000000000015bc0 0000000000015bc0 ffff880308d9df80 ffff8801adcd9fd8 0000000000015bc0 ffff880308d9dbc0 0000000000015bc0 ffff8801adcd9fd8 0000000000015bc0 ffff880308d9df80 Call Trace: [<ffffffff8153f867>] __mutex_lock_slowpath+0xe7/0x170 [<ffffffff8153f75b>] mutex_lock+0x2b/0x50 [<ffffffff8123a1d1>] fuse_file_llseek+0x41/0xe0 [<ffffffff8114238a>] vfs_llseek+0x3a/0x40 [<ffffffff81142fd6>] sys_lseek+0x66/0x80 [<ffffffff810131b2>] system_call_fastpath+0x16/0x1b Gluster Configuration: ### glusterfsd.vol ### volume posix type storage/posix option directory /data/export end-volume volume locks type features/locks subvolumes posix end-volume volume brick type performance/io-threads option thread-count 16 subvolumes locks end-volume volume server type protocol/server option transport-type tcp option transport.socket.nodelay on option transport.socket.bind-address 192.168.158.141 option auth.addr.brick.allow 192.168.158.* subvolumes brick end-volume ### glusterfs.vol ### volume gluster1 type protocol/client option transport-type tcp option remote-host 192.168.158.141 option remote-subvolume brick end-volume volume gluster2 type protocol/client option transport-type tcp option remote-host 192.168.158.142 option remote-subvolume brick end-volume volume replicate type cluster/replicate subvolumes gluster1 gluster2 end-volume ### fstab ### /etc/glusterfs/glusterfs.vol /mnt/glusterfs glusterfs log-level=DEBUG,direct-io-mode=disable 0 0 I read that you wanted users to kill -11 the glusterfs process for more debug info - here it is: pending frames: frame : type(1) op(WRITE) frame : type(1) op(WRITE) frame : type(1) op(WRITE) frame : type(1) op(WRITE) frame : type(1) op(WRITE) frame : type(1) op(WRITE) frame : type(1) op(WRITE) frame : type(1) op(WRITE) frame : type(1) op(WRITE) frame : type(1) op(WRITE) frame : type(1) op(WRITE) frame : type(1) op(WRITE) frame : type(1) op(WRITE) frame : type(1) op(WRITE) frame : type(1) op(WRITE) frame : type(1) op(WRITE) frame : type(1) op(WRITE) frame : type(1) op(WRITE) patchset: v3.0.2 signal received: 11 time of crash: 2010-09-28 11:14:31 configuration details: argp 1 backtrace 1 dlfcn 1 fdatasync 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.0.2 /lib/libc.so.6(+0x33af0)[0x7f0c6bf0eaf0] /lib/libc.so.6(epoll_wait+0x33)[0x7f0c6bfc1c93] /usr/lib/libglusterfs.so.0(+0x2e261)[0x7f0c6c6ac261] glusterfs(main+0x852)[0x4044f2] /lib/libc.so.6(__libc_start_main+0xfd)[0x7f0c6bef9c4d] glusterfs[0x402ab9] ---------