VM hangs while autohealing

gluster-users at lk2.de (Fred Fischer) · Tue, 28 Sep 2010 12:15:59 +0200

  Hi,

i have 2 machines running a simple replicate volume to provide highly 
available storage for kvm virtual machines.
As soon as auto healing starts, glusterfs will start blocking the vm's 
storage access (apparently writes are what causes this) leaving the 
whole virtual machine hanging.
I can replicate this bug on both ext3 and ext4 filesystems, on real 
machines as well as on vm's.

Any help would be appreciated, we have to run the vm's without glusterfs 
at the moment because of this problem :-(

More on my config:

* Ubuntu 10.04 Server 64bit
* Kernel 2.6.32-21-server
* Fuse 2.8.1
* Glusterfs v3.0.2

How to replicate:

* 2 Nodes running glusterfs replicate
* Start KVM virtual machine with diskfile on glusterfs
* Stop glusterfsd on one node
* Make changes to the diskfile
* Bring glusterfsd back online (auto healing starts) (replicate: no 
missing files - /image.raw. proceeding to metadata check)
* As soon as the vm starts writing data, it will be blocked until 
autohealing finishes (Making it completely unresponsive)

Message from Kernel (Printed several times while healing):

INFO: task kvm:7774 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kvm           D 00000000ffffffff     0  7774      1 0x00000000
ffff8801adcd9e48 0000000000000082 0000000000015bc0 0000000000015bc0
ffff880308d9df80 ffff8801adcd9fd8 0000000000015bc0 ffff880308d9dbc0
0000000000015bc0 ffff8801adcd9fd8 0000000000015bc0 ffff880308d9df80
Call Trace:
[<ffffffff8153f867>] __mutex_lock_slowpath+0xe7/0x170
[<ffffffff8153f75b>] mutex_lock+0x2b/0x50
[<ffffffff8123a1d1>] fuse_file_llseek+0x41/0xe0
[<ffffffff8114238a>] vfs_llseek+0x3a/0x40
[<ffffffff81142fd6>] sys_lseek+0x66/0x80
[<ffffffff810131b2>] system_call_fastpath+0x16/0x1b

Gluster Configuration:

### glusterfsd.vol ###
volume posix
   type storage/posix
   option directory /data/export
end-volume

volume locks
   type features/locks
   subvolumes posix
end-volume

volume brick
   type performance/io-threads
   option thread-count 16
   subvolumes locks
end-volume

volume server
   type protocol/server
   option transport-type tcp
   option transport.socket.nodelay on
   option transport.socket.bind-address 192.168.158.141
   option auth.addr.brick.allow 192.168.158.*
   subvolumes brick
end-volume

### glusterfs.vol ###
volume gluster1
   type protocol/client
   option transport-type tcp
   option remote-host 192.168.158.141
   option remote-subvolume brick
end-volume

volume gluster2
   type protocol/client
   option transport-type tcp
   option remote-host 192.168.158.142
   option remote-subvolume brick
end-volume

volume replicate
   type cluster/replicate
   subvolumes gluster1 gluster2
end-volume

### fstab ###
/etc/glusterfs/glusterfs.vol  /mnt/glusterfs  glusterfs 
log-level=DEBUG,direct-io-mode=disable 0  0

I read that you wanted users to kill -11 the glusterfs process for more 
debug info - here it is:

pending frames:
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)

patchset: v3.0.2
signal received: 11
time of crash: 2010-09-28 11:14:31
configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.0.2
/lib/libc.so.6(+0x33af0)[0x7f0c6bf0eaf0]
/lib/libc.so.6(epoll_wait+0x33)[0x7f0c6bfc1c93]
/usr/lib/libglusterfs.so.0(+0x2e261)[0x7f0c6c6ac261]
glusterfs(main+0x852)[0x4044f2]
/lib/libc.so.6(__libc_start_main+0xfd)[0x7f0c6bef9c4d]
glusterfs[0x402ab9]
---------