What distribution, kernel and GlusterFS versions are you using? I've had problems with some kernels between 2.6.31 and 2.6.32-r10 (-r10 is a Gentoo release number) with GlusterFS 3.0.4 inside a KVM and KVM versions < 0.12.4. Kernel 2.6.30 or latest 2.6.32-r11 and KVM 0.12.4 seem to work fine now. We also have had some NFSv4 problems with the kernels mentioned above. 2.6.30 and 2.6.32-r11 doesn't have this problem. The problem was a race condition in KVM's virtio-net driver. Maybe you have some problems with your network(driver) too. Some weeks ago we have moved a directory which was used by a Drupal cache module from GlusterFS to NFSv4 since we've had a lot of problems with it. There was also heavy writing and reading into this directory. Currently I'm not sure anymore if this was a kernel or GlusterFS problem. I've never tested it again with GlusterFS. Some other directories with mainly reading are working fine. - Robert On 06/22/10 17:40, Hugo Cisneiros (Eitch) wrote: > Hi, > > We have a web server installation that is using glusterFS. There are 3 > GlusterFS servers and 4 GlusterFS clients acessing one share. We're > having a problem that randomly, the httpd processes begin to hung > while waiting for I/O. This begins to happen with the following kernel > call trace: > > kernel: INFO: task httpd.worker:18974 blocked for more than 120 seconds. > kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables > this message. > kernel: httpd.worker D ffffffff80150462 0 18974 4818 > 18975 18972 (NOTLB) > kernel: ffff8108c0ee7c78 0000000000000086 ffff8101f3b2b000 ffff810116d0f200 > kernel: ffff81037392c6d8 000000000000000a ffff8106b73d57a0 ffff810116dd4100 > kernel: 00004f0baea8a21e 00000000000026f5 ffff8106b73d5988 0000000300000000 > kernel: Call Trace: > kernel: [<ffffffff88443cda>] :fuse:fuse_dentry_revalidate+0x13e/0x1c1 > kernel: [<ffffffff80063c6f>] __mutex_lock_slowpath+0x60/0x9b > kernel: [<ffffffff80063cb9>] .text.lock.mutex+0xf/0x14 > kernel: [<ffffffff8000cec2>] do_lookup+0x90/0x1e6 > kernel: [<ffffffff80009bb2>] __link_path_walk+0x3a6/0xf42 > kernel: [<ffffffff8000e9e2>] link_path_walk+0x42/0xb2 > kernel: [<ffffffff8000ccb2>] do_path_lookup+0x275/0x2f1 > kernel: [<ffffffff8001280e>] getname+0x15b/0x1c2 > kernel: [<ffffffff80023876>] __user_walk_fd+0x37/0x4c > kernel: [<ffffffff80028846>] vfs_stat_fd+0x1b/0x4a > kernel: [<ffffffff8000c5fe>] _atomic_dec_and_lock+0x39/0x57 > kernel: [<ffffffff800235a8>] sys_newstat+0x19/0x31 > kernel: [<ffffffff80012b54>] __fput+0x191/0x1bd > kernel: [<ffffffff8002c9e4>] mntput_no_expire+0x19/0x89 > kernel: [<ffffffff80023b84>] filp_close+0x5c/0x64 > kernel: [<ffffffff8001dfa6>] sys_close+0x88/0xbd > kernel: [<ffffffff8005d116>] system_call+0x7e/0x83 > > Looking at the glusterfs client logs, the following entries begin to > show up before this call trace begins: > > [fuse-bridge.c:585:fuse_lookup] glusterfs-fuse: 89023783: LOOKUP > 46912563008752/2010 (fuse_loc_fill() failed) > [fuse-bridge.c:793:fuse_getattr] glusterfs-fuse: 89024196: GETATTR > 46912563008752 (fuse_loc_fill() failed) > [fuse-bridge.c:793:fuse_getattr] glusterfs-fuse: 89024203: GETATTR > 46912563008752 (fuse_loc_fill() failed) > [fuse-bridge.c:793:fuse_getattr] glusterfs-fuse: 89024207: GETATTR > 46912563008752 (fuse_loc_fill() failed) > [fuse-bridge.c:793:fuse_getattr] glusterfs-fuse: 89024210: GETATTR > 46912563008752 (fuse_loc_fill() failed) > [fuse-bridge.c:793:fuse_getattr] glusterfs-fuse: 89024213: GETATTR > 46912563008752 (fuse_loc_fill() failed) > [fuse-bridge.c:793:fuse_getattr] glusterfs-fuse: 89024218: GETATTR > 46912563008752 (fuse_loc_fill() failed) > > The same message repeats 1116 times. Talking with a friend, we noticed > that the 4 web servers are reading *and* writing very concurrently and > because of this, there are many lookups, reads and unlinks returning > "no such file or directory" on the log, like these: > > [fuse-bridge.c:859:fuse_fd_cbk] glusterfs-fuse: 88969079: OPEN() > /cache/70ea5f0d4df01648be4f70d9813f24b7.meta => -1 (No such file or > directory) > [fuse-bridge.c:1215:fuse_unlink_cbk] glusterfs-fuse: 88969080: > UNLINK() /cache/70ea5f0d4df01648be4f70d9813f24b7.meta => -1 (No such > file or directory) > [fuse-bridge.c:491:fuse_entry_cbk] glusterfs-fuse: > LOOKUP(/cache/index.html) inode (ptr=0x2aaab313f9f0, ino=1528908, > gen=5480759143103581058) found conflict (ptr=0x2aaadebc8690, > ino=1528908, gen=5480759143103581058) > > After the kernel call trace appears, the fuse filesystem becomes very > unresponsive and unstable. Who tries to read the mounted directory, > randomly gets its process locked (with status waiting for I/O) and > hangs forever. We know that maybe glusterfs is not made for high > concurrency for both writing and reading, so we're working on getting > an alternate solution for this behavior (it's a disk cache, we're > transfering for something like memcached). But maybe this error helps > to find something unexpected going on... :) > > Here's the client configuration: > > volume vol01 > type protocol/client > option transport-type tcp > option remote-host server01.domain > option remote-subvolume vol > end-volume > > volume vol02 > type protocol/client > option transport-type tcp > option remote-host server02.domain > option remote-subvolume vol > end-volume > > volume vol03 > type protocol/client > option transport-type tcp > option remote-host server03.domain > option remote-subvolume vol > end-volume > > volume vol-mirror > type cluster/replicate > subvolumes vol01 vol02 vol03 > end-volume > > volume vol-iocache > type performance/io-cache > option cache-size 128MB > option cache-timeout 2 > subvolumes vol-mirror > end-volume > > volume vol-quickread > type performance/quick-read > option cache-timeout 1 > option max-file-size 64kB > subvolumes vol-iocache > end-volume > > volume vol-writebehind > type performance/write-behind > option cache-size 4MB > subvolumes vol-quickread > end-volume > > volume vol-statprefetch > type performance/stat-prefetch > subvolumes vol-writebehind > end-volume > > Thanks, >