Re: Parallel readdir from NFS clients causes incorrect data

Michael Brown <michael@xxxxxxxxxxxx> · Thu, 04 Apr 2013 12:31:49 -0400



    I'm not quite keen on trying HEAD on
      these servers yet, but I did grab the source package from
      http://repos.fedorapeople.org/repos/kkeithle/glusterfs/epel-6Server/SRPMS/
      and apply the patch manually.

      
      Much better! Looks like that did the trick.

      
      M.

      
      On 13-04-03 07:57 PM, Anand Avati wrote:

    
    Here's a patch on top of today's git HEAD, if you can
      try - http://review.gluster.org/4774/
      

      Thanks!
      Avati

        
        On Wed, Apr 3, 2013 at 4:35 PM, Anand
          Avati <anand.avati@xxxxxxxxx>
          wrote:

          Hmm, I was
            be tempted to suggest that you were bitten by the
            gluster/ext4 readdir's d_off incompatibility issue (which
            got recently fixed http://review.gluster.org/4711/).
            But you say it works fine when you do ls one at a time
            sequentially.
            
              
            I just realized after reading your email that, in
              glusterfs, because we use the same anonymous fd for
              multiple client/application's readdir query, we have a
              race in the posix translator where two threads attempt to
              push/pull the same backend cursor in a chaotic way
              resulting in duplicate/lost entries. This might be the
              issue you are seeing, just guessing.
            

            Will you be willing to try out a source cod patch on
              top of the git HEAD to rebuild your glusterfs and verify
              if it fixes the issue? Will really appreciate it!
            

            Thanks,
            
              Avati

              
                  On Wed, Apr 3, 2013 at 2:37 PM,
                    Michael Brown <michael@xxxxxxxxxxxx>
                    wrote:

                  
                       I'm seeing
                        a problem on my fairly fresh RHEL gluster
                        install. Smells to me like a parallelism problem
                        on the server.

                        
                        If I mount a gluster volume via NFS (using
                        glusterd's internal NFS server,
                        nfs-kernel-server) and read a directory from
                        multiple clients *in parallel*, I get
                        inconsistent results across servers. Some files
                        are missing from the directory listing, some may
                        be present twice!

                        
                        Exactly which files (or directories!) are
                        missing/duplicated varies each time. But I can
                        very consistently reproduce the behaviour.

                        
                        You can see a screenshot here: http://imgur.com/JU8AFrt

                        
                        The replication steps are:

                        * clusterssh to each NFS client

                        * unmount /gv0 (to clear cache)

                        * mount /gv0 [1]

                        * ls -al /gv0/common/apache-jmeter-2.9/bin
                        (which is where I first noticed this)

                        
                        Here's the rub: if, instead of doing the 'ls' in
                        parallel, I do it in series, it works just fine
                        (consistent correct results everywhere). But
                        hitting the gluster server from multiple clients
                        at the same time causes problems.

                        
                        I can still stat() and open() the files missing
                        from the directory listing, they just don't show
                        up in an enumeration.

                        
                        Mounting gv0 as a gluster client filesystem
                        works just fine.

                        
                        Details of my setup:

                        2 × gluster servers: 2×E5-2670, 128GB RAM, RHEL
                        6.4 64-bit, glusterfs-server-3.3.1-1.el6.x86_64
                        (from EPEL)

                        4 × NFS clients: 2×E5-2660, 128GB RAM, RHEL 5.7
                        64-bit, glusterfs-3.3.1-11.el5 (from kkeithley's
                        repo, only used for testing)

                        gv0 volume information is below

                        bricks are 400GB SSDs with ext4[2]

                        common network is 10GbE, replication between
                        servers happens over direct 10GbE link.

                        
                        I will be testing on xfs/btrfs/zfs eventually,
                        but for now I'm on ext4. 

                        
                        Also attached is my chatlog from asking about
                        this in #gluster

                        
                        [1]: fstab line is: fearless1:/gv0 /gv0 nfs
                          defaults,sync,tcp,wsize=8192,rsize=8192 0 0

                        [2]: yes, I've turned off dir_index to avoid
                        That Bug. I've run the d_off test, results are
                        here: http://pastebin.com/zQt5gZnZ

                        
                        ----

                        gluster> volume info gv0

                         
                        Volume Name: gv0

                        Type: Distributed-Replicate

                        Volume ID:
                          20117b48-7f88-4f16-9490-a0349afacf71

                        Status: Started

                        Number of Bricks: 8 x 2 = 16

                        Transport-type: tcp

                        Bricks:

                        Brick1:
                          fearless1:/export/bricks/500117310007a6d8/glusterdata

                        Brick2:
                          fearless2:/export/bricks/500117310007a674/glusterdata

                        Brick3:
                          fearless1:/export/bricks/500117310007a714/glusterdata

                        Brick4:
                          fearless2:/export/bricks/500117310007a684/glusterdata

                        Brick5:
                          fearless1:/export/bricks/500117310007a7dc/glusterdata

                        Brick6:
                          fearless2:/export/bricks/500117310007a694/glusterdata

                        Brick7:
                          fearless1:/export/bricks/500117310007a7e4/glusterdata

                        Brick8:
                          fearless2:/export/bricks/500117310007a720/glusterdata

                        Brick9:
                          fearless1:/export/bricks/500117310007a7ec/glusterdata

                        Brick10:
                          fearless2:/export/bricks/500117310007a74c/glusterdata

                        Brick11:
                          fearless1:/export/bricks/500117310007a838/glusterdata

                        Brick12:
                          fearless2:/export/bricks/500117310007a814/glusterdata

                        Brick13:
                          fearless1:/export/bricks/500117310007a850/glusterdata

                        Brick14:
                          fearless2:/export/bricks/500117310007a84c/glusterdata

                        Brick15:
                          fearless1:/export/bricks/500117310007a858/glusterdata

                        Brick16:
                          fearless2:/export/bricks/500117310007a8f8/glusterdata

                        Options Reconfigured:

                        diagnostics.count-fop-hits: on

                        diagnostics.latency-measurement: on

                        nfs.disable: off

                        ----

                            
                            -- 
Michael Brown               | `One of the main causes of the fall of
Systems Consultant          | the Roman Empire was that, lacking zero,
Net Direct Inc.             | they had no way to indicate successful
☎: +1 519 883 1172 x5106    | termination of their C programs.' - Firth

                          
                  _______________________________________________

                  Gluster-devel mailing list

                  Gluster-devel@xxxxxxxxxx

                  https://lists.nongnu.org/mailman/listinfo/gluster-devel

                  
    -- 
Michael Brown               | `One of the main causes of the fall of
Systems Consultant          | the Roman Empire was that, lacking zero,
Net Direct Inc.             | they had no way to indicate successful
☎: +1 519 883 1172 x5106    | termination of their C programs.' - Firth