Re: Parallel readdir from NFS clients causes incorrect data

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I'm not quite keen on trying HEAD on these servers yet, but I did grab the source package from http://repos.fedorapeople.org/repos/kkeithle/glusterfs/epel-6Server/SRPMS/ and apply the patch manually.

Much better! Looks like that did the trick.

M.

On 13-04-03 07:57 PM, Anand Avati wrote:
Here's a patch on top of today's git HEAD, if you can try - http://review.gluster.org/4774/

Thanks!
Avati

On Wed, Apr 3, 2013 at 4:35 PM, Anand Avati <anand.avati@xxxxxxxxx> wrote:
Hmm, I was be tempted to suggest that you were bitten by the gluster/ext4 readdir's d_off incompatibility issue (which got recently fixed http://review.gluster.org/4711/). But you say it works fine when you do ls one at a time sequentially.

I just realized after reading your email that, in glusterfs, because we use the same anonymous fd for multiple client/application's readdir query, we have a race in the posix translator where two threads attempt to push/pull the same backend cursor in a chaotic way resulting in duplicate/lost entries. This might be the issue you are seeing, just guessing.

Will you be willing to try out a source cod patch on top of the git HEAD to rebuild your glusterfs and verify if it fixes the issue? Will really appreciate it!

Thanks,
Avati

On Wed, Apr 3, 2013 at 2:37 PM, Michael Brown <michael@xxxxxxxxxxxx> wrote:
I'm seeing a problem on my fairly fresh RHEL gluster install. Smells to me like a parallelism problem on the server.

If I mount a gluster volume via NFS (using glusterd's internal NFS server, nfs-kernel-server) and read a directory from multiple clients *in parallel*, I get inconsistent results across servers. Some files are missing from the directory listing, some may be present twice!

Exactly which files (or directories!) are missing/duplicated varies each time. But I can very consistently reproduce the behaviour.

You can see a screenshot here: http://imgur.com/JU8AFrt

The replication steps are:
* clusterssh to each NFS client
* unmount /gv0 (to clear cache)
* mount /gv0 [1]
* ls -al /gv0/common/apache-jmeter-2.9/bin (which is where I first noticed this)

Here's the rub: if, instead of doing the 'ls' in parallel, I do it in series, it works just fine (consistent correct results everywhere). But hitting the gluster server from multiple clients at the same time causes problems.

I can still stat() and open() the files missing from the directory listing, they just don't show up in an enumeration.

Mounting gv0 as a gluster client filesystem works just fine.

Details of my setup:
2 × gluster servers: 2×E5-2670, 128GB RAM, RHEL 6.4 64-bit, glusterfs-server-3.3.1-1.el6.x86_64 (from EPEL)
4 × NFS clients: 2×E5-2660, 128GB RAM, RHEL 5.7 64-bit, glusterfs-3.3.1-11.el5 (from kkeithley's repo, only used for testing)
gv0 volume information is below
bricks are 400GB SSDs with ext4[2]
common network is 10GbE, replication between servers happens over direct 10GbE link.

I will be testing on xfs/btrfs/zfs eventually, but for now I'm on ext4.

Also attached is my chatlog from asking about this in #gluster

[1]: fstab line is: fearless1:/gv0 /gv0 nfs defaults,sync,tcp,wsize=8192,rsize=8192 0 0
[2]: yes, I've turned off dir_index to avoid That Bug. I've run the d_off test, results are here: http://pastebin.com/zQt5gZnZ

----
gluster> volume info gv0
 
Volume Name: gv0
Type: Distributed-Replicate
Volume ID: 20117b48-7f88-4f16-9490-a0349afacf71
Status: Started
Number of Bricks: 8 x 2 = 16
Transport-type: tcp
Bricks:
Brick1: fearless1:/export/bricks/500117310007a6d8/glusterdata
Brick2: fearless2:/export/bricks/500117310007a674/glusterdata
Brick3: fearless1:/export/bricks/500117310007a714/glusterdata
Brick4: fearless2:/export/bricks/500117310007a684/glusterdata
Brick5: fearless1:/export/bricks/500117310007a7dc/glusterdata
Brick6: fearless2:/export/bricks/500117310007a694/glusterdata
Brick7: fearless1:/export/bricks/500117310007a7e4/glusterdata
Brick8: fearless2:/export/bricks/500117310007a720/glusterdata
Brick9: fearless1:/export/bricks/500117310007a7ec/glusterdata
Brick10: fearless2:/export/bricks/500117310007a74c/glusterdata
Brick11: fearless1:/export/bricks/500117310007a838/glusterdata
Brick12: fearless2:/export/bricks/500117310007a814/glusterdata
Brick13: fearless1:/export/bricks/500117310007a850/glusterdata
Brick14: fearless2:/export/bricks/500117310007a84c/glusterdata
Brick15: fearless1:/export/bricks/500117310007a858/glusterdata
Brick16: fearless2:/export/bricks/500117310007a8f8/glusterdata
Options Reconfigured:
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
nfs.disable: off
----

-- 
Michael Brown               | `One of the main causes of the fall of
Systems Consultant          | the Roman Empire was that, lacking zero,
Net Direct Inc.             | they had no way to indicate successful
☎: +1 519 883 1172 x5106    | termination of their C programs.' - Firth

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxx
https://lists.nongnu.org/mailman/listinfo/gluster-devel





-- 
Michael Brown               | `One of the main causes of the fall of
Systems Consultant          | the Roman Empire was that, lacking zero,
Net Direct Inc.             | they had no way to indicate successful
☎: +1 519 883 1172 x5106    | termination of their C programs.' - Firth

[Index of Archives]     [Gluster Users]     [Ceph Users]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux