client coherence problem with locks and truncate

rmillner at webappvm.com (Robert L. Millner) · Fri, 4 Sep 2009 12:01:09 -0700

Hi,

We're observing a coherence issue with GlusterFS 2.0.6.  One client
opens a file, locks, truncates and writes.  Another client waiting on a
read lock may see a zero length file after the read lock is granted.

If both nodes read/write in a loop, this tends to happen within a few
hundred tries.  The same code runs for 10000 loops without a problem if
both programs run on GlusterFS on the same node or local file system
(ext3) on the same node.

Node1 does the following (strace): 

2206  1252031615.509555 open("testfile", O_RDWR|O_CREAT|O_LARGEFILE, 0644) = 3
2206  1252031615.514886 fcntl64(3, F_SETLKW64, {type=F_WRLCK, whence=SEEK_SET, start=0, len=0}, 0xbfcaee78) = 0
2206  1252031615.517742 select(0, NULL, NULL, NULL, {0, 0}) = 0 (Timeout)
2206  1252031615.517788 _llseek(3, 0, [0], SEEK_SET) = 0
2206  1252031615.517829 ftruncate64(3, 0) = 0
2206  1252031615.520632 write(3, "01234567890123456789012345678901"..., 900) = 900
2206  1252031615.599782 close(3)        = 0

2206  1252031615.604731 open("testfile", O_RDONLY|O_CREAT|O_LARGEFILE, 0644) = 3
2206  1252031615.615158 fcntl64(3, F_SETLKW64, {type=F_RDLCK, whence=SEEK_SET, start=0, len=0}, 0xbfcaee78) = 0
2206  1252031615.624680 fstat64(3, {st_dev=makedev(0, 13), st_ino=182932, st_mode=S_IFREG|0644, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=16, st_size=900, st_atime=2009/09/03-19:33:35, st_mtime=2009/09/03-19:33:35, st_ctime=2009/09/03-19:33:35}) = 0
2206  1252031615.624787 _llseek(3, 0, [0], SEEK_SET) = 0
2206  1252031615.624851 read(3, "01234567890123456789012345678901"..., 4096) = 900
2206  1252031615.625126 close(3)        = 0

Node2 does the following (strace):

2126  1252031615.504350 open("testfile", O_RDONLY|O_CREAT|O_LARGEFILE, 0644) = 3
2126  1252031615.509004 fcntl64(3, F_SETLKW64, {type=F_RDLCK, whence=SEEK_SET, start=0, len=0}, 0xbfc05dc8) = 0
2126  1252031615.587697 fstat64(3, {st_dev=makedev(0, 13), st_ino=182932, st_mode=S_IFREG|0644, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=8, st_size=0, st_atime=2009/09/03-19:33:35, st_mtime=2009/09/03-19:33:35, st_ctime=2009/09/03-19:33:35}) = 0
2126  1252031615.588027 _llseek(3, 0, [0], SEEK_SET) = 0
2126  1252031615.588089 read(3, "", 4096) = 0
2126  1252031615.588228 close(3)        = 0

Both node clocks are NTP disciplined.  As these are virtual machines,
there's a higher dispersion but I believe you can round to the nearest
0.1s for time correlation.

Node2 waits for the write lock to clear before getting its read lock.
Node1 also tries to read the file and agrees with node2 on every stat
field except st_size.  Node2 tries to read the file and gets no data.

This is on 32 bit CentOS5 with a 2.6.27 kernel, fuse 2.7.4 on VMware.
Also observed on Amazon EC2 with their 2.6.21 fc8xen kernel.

I can make the problem unrepeatable in 10000 tries by changing the
select on Node1 to timeout in 0.1 seconds.  The problem repeats in under
5000 tries if select is set to timeout in 0.01 seconds.

This happens whether or not gluster is run with
--disable-direct-io-mode.

The volume is mirrored between four servers.  Below is the server
configuration.  The export directory is on ext3.

volume posix
  type storage/posix
  option directory /var/data/export
end-volume

volume locks
  type features/locks
  option mandatory-locks on
  subvolumes posix
end-volume

volume brick
  type performance/io-threads
  option thread-count 8
  subvolumes locks
end-volume

volume server
  type protocol/server
  option transport-type tcp
  option auth.addr.brick.allow *
  subvolumes brick
end-volume

And the client configuration:

volume remote1
  type protocol/client
  option transport-type tcp
  option remote-host 10.10.10.145
  option remote-subvolume brick
end-volume

volume remote2
  type protocol/client
  option transport-type tcp
  option remote-host 10.10.10.130
  option remote-subvolume brick
end-volume

volume remote3
  type protocol/client
  option transport-type tcp
  option remote-host 10.10.10.221
  option remote-subvolume brick
end-volume

volume remote4
  type protocol/client
  option transport-type tcp
  option remote-host 10.10.10.104
  option remote-subvolume brick
end-volume

volume replicated
  type cluster/replicate
  subvolumes remote1 remote2 remote3 remote4
end-volume

volume writebehind
    type performance/write-behind
    subvolumes replicated
end-volume

volume cache
    type performance/io-cache
    subvolumes writebehind
end-volume

The problem persists with those configurations and if any or all of the
following tweaks are made:

1. Remove the replicated volume and just use remote1.
2. Get rid of threads on the server.
3. Get rid of io-cache and writebehind on the clients.
4. Use mandatory locking on the test file.

Please let me know if there's any more information needed to debug this
further or any guidance on how to avoid it.

Thank you!

    Cheers,
    Rob