Re: Parallel readdir from NFS clients causes incorrect data

Anand Avati <anand.avati@xxxxxxxxx> · Wed, 3 Apr 2013 16:57:00 -0700

Here's a patch on top of today's git HEAD, if you can try - http://review.gluster.org/4774/
Thanks!
Avati

On Wed, Apr 3, 2013 at 4:35 PM, Anand Avati <anand.avati@xxxxxxxxx> wrote:

Hmm, I was be tempted to suggest that you were bitten by the gluster/ext4 readdir's d_off incompatibility issue (which got recently fixed http://review.gluster.org/4711/). But you say it works fine when you do ls one at a time sequentially.

I just realized after reading your email that, in glusterfs, because we use the same anonymous fd for multiple client/application's readdir query, we have a race in the posix translator where two threads attempt to push/pull the same backend cursor in a chaotic way resulting in duplicate/lost entries. This might be the issue you are seeing, just guessing.

Will you be willing to try out a source cod patch on top of the git HEAD to rebuild your glusterfs and verify if it fixes the issue? Will really appreciate it!

Thanks,

Avati

On Wed, Apr 3, 2013 at 2:37 PM, Michael Brown <michael@xxxxxxxxxxxx> wrote:

    I'm seeing a problem on my fairly fresh RHEL gluster install. Smells
    to me like a parallelism problem on the server.

    If I mount a gluster volume via NFS (using glusterd's internal NFS
    server, nfs-kernel-server) and read a directory from multiple
    clients *in parallel*, I get inconsistent results across servers.
    Some files are missing from the directory listing, some may be
    present twice!

    Exactly which files (or directories!) are missing/duplicated varies
    each time. But I can very consistently reproduce the behaviour.

    You can see a screenshot here: http://imgur.com/JU8AFrt

    The replication steps are:

    * clusterssh to each NFS client

    * unmount /gv0 (to clear cache)

    * mount /gv0 [1]

    * ls -al /gv0/common/apache-jmeter-2.9/bin
    (which is where I first noticed this)

    Here's the rub: if, instead of doing the 'ls' in parallel, I do it
    in series, it works just fine (consistent correct results
    everywhere). But hitting the gluster server from multiple clients at
      the same time causes problems.

    I can still stat() and open() the files missing from the directory
    listing, they just don't show up in an enumeration.

    Mounting gv0 as a gluster client filesystem works just fine.

    Details of my setup:

    2 × gluster servers: 2×E5-2670, 128GB RAM, RHEL 6.4 64-bit,
    glusterfs-server-3.3.1-1.el6.x86_64 (from EPEL)

    4 × NFS clients: 2×E5-2660, 128GB RAM, RHEL 5.7 64-bit,
    glusterfs-3.3.1-11.el5 (from kkeithley's repo, only used for
    testing)

    gv0 volume information is below

    bricks are 400GB SSDs with ext4[2]

    common network is 10GbE, replication between servers happens over
    direct 10GbE link.

    I will be testing on xfs/btrfs/zfs eventually, but for now I'm on
    ext4. 

    Also attached is my chatlog from asking about this in #gluster

    [1]: fstab line is: fearless1:/gv0 /gv0 nfs
      defaults,sync,tcp,wsize=8192,rsize=8192 0 0

    [2]: yes, I've turned off dir_index to avoid That Bug. I've run the
    d_off test, results are here: http://pastebin.com/zQt5gZnZ

    ----

    gluster> volume info gv0

    Volume Name: gv0

    Type: Distributed-Replicate

    Volume ID: 20117b48-7f88-4f16-9490-a0349afacf71

    Status: Started

    Number of Bricks: 8 x 2 = 16

    Transport-type: tcp

    Bricks:

    Brick1:
      fearless1:/export/bricks/500117310007a6d8/glusterdata

    Brick2:
      fearless2:/export/bricks/500117310007a674/glusterdata

    Brick3:
      fearless1:/export/bricks/500117310007a714/glusterdata

    Brick4:
      fearless2:/export/bricks/500117310007a684/glusterdata

    Brick5:
      fearless1:/export/bricks/500117310007a7dc/glusterdata

    Brick6:
      fearless2:/export/bricks/500117310007a694/glusterdata

    Brick7:
      fearless1:/export/bricks/500117310007a7e4/glusterdata

    Brick8:
      fearless2:/export/bricks/500117310007a720/glusterdata

    Brick9:
      fearless1:/export/bricks/500117310007a7ec/glusterdata

    Brick10:
      fearless2:/export/bricks/500117310007a74c/glusterdata

    Brick11:
      fearless1:/export/bricks/500117310007a838/glusterdata

    Brick12:
      fearless2:/export/bricks/500117310007a814/glusterdata

    Brick13:
      fearless1:/export/bricks/500117310007a850/glusterdata

    Brick14:
      fearless2:/export/bricks/500117310007a84c/glusterdata

    Brick15:
      fearless1:/export/bricks/500117310007a858/glusterdata

    Brick16:
      fearless2:/export/bricks/500117310007a8f8/glusterdata

    Options Reconfigured:

    diagnostics.count-fop-hits: on

    diagnostics.latency-measurement: on

    nfs.disable: off

    ----

    -- 
Michael Brown               | `One of the main causes of the fall of
Systems Consultant          | the Roman Empire was that, lacking zero,
Net Direct Inc.             | they had no way to indicate successful
☎: +1 519 883 1172 x5106    | termination of their C programs.' - Firth

_______________________________________________

Gluster-devel mailing list

Gluster-devel@xxxxxxxxxx

https://lists.nongnu.org/mailman/listinfo/gluster-devel