Re: Eager-lock and nfs graph generation

Pranith Kumar K <pkarampu@xxxxxxxxxx> · Wed, 20 Feb 2013 07:41:40 +0530



    On 02/20/2013 07:03 AM, Anand Avati
      wrote:

    
      On Tue, Feb 19, 2013 at 5:12 PM, Anand
        Avati <anand.avati@xxxxxxxxx>
        wrote:

        
              On Tue, Feb 19, 2013 at 3:59 AM, Pranith
                Kumar K <pkarampu@xxxxxxxxxx>
                wrote:

                
                        On 02/19/2013 11:26 AM, Anand Avati wrote:

                        
                          Thinking over this, looks like there is a
                            problem!
                          Write-behind guarantees: That a second
                            write request arriving after the
                            acknowledgement of a first overlapping
                            request (whether written-behind or
                            otherwise) will be guaranteed to be
                            fulfilled in the backend in the same order
                            (i.e, the second overlapping request will be
                            "serialized" behind the first one in the
                            fulfillment process)
                          Eager-lock requirement: That write-behind
                            will send no two write requests on an
                            overlapping region at the same time.
                          The requirement-set and guarantee-set have
                            a big overlap, but the requirement-set is
                            not a subset.
                          This is because of O_SYNC writes.
                            write-behind performs write-serialization at
                            fulfillment only for written behind requests
                            (which get covered under the conflict
                            detection code during liability
                            fulfillment). However, if two threads (or
                            apps) issue overlapping O_SYNC writes to the
                            same region at approx same time, then
                            write-behind will let both of them go by
                            without any kind of serialization, into
                            eager lock, violating the assumptions!
                          I'm wondering if it is a safer idea to
                            implement overlap checks within eager-lock
                            code itself rather than depend on
                            write-behind :|
                          Avati
                          

                          On Mon, Feb 11, 2013
                            at 10:07 PM, Anand Avati <anand.avati@xxxxxxxxx>
                            wrote:

                            
                                On Mon, Feb 11, 2013 at 9:32 PM,
                                  Pranith Kumar K <pkarampu@xxxxxxxxxx>
                                  wrote:

                                  
                                     hi,

                                      Please note that this is a case in
                                      theory and I did not run into such
                                      situation, but I feel it is
                                      important to address this. 

                                      Configuration with 'Eager-lock on"
                                      and "write-behind off" should not
                                      be allowed as it leads to lock
                                      synchronization problems which
                                      lead to data in-consistency among
                                      replicas in nfs.

                                      lets say bricks b1, b2 are in
                                      replication.

                                      Gluster Nfs server uses 1
                                      anonymous fd to perform all
                                      write-fops. If eager-lock is
                                      enabled in afr, the lock-owner is
                                      used as fd's address which will be
                                      same for all write-fops, so there
                                      will never be any inodelk
                                      contention. If write-behind is
                                      disabled, there can be writes that
                                      overlap. (Does nfs makes sure that
                                      the ranges don't overlap?)

                                      
                                      Now imagine the following
                                      scenario:

                                      lets say w1, w2 are 2 write fops
                                      on same offset and length. w1 with
                                      all '0's and w2 with all '1's. If
                                      these 2 write fops are executed in
                                      2 different threads, the order of
                                      arrival of write fops on b1 can be
                                      w1, w2 where as on b2 it is w2, w1
                                      leading to data inconsistency
                                      between the two replicas. The lock
                                      contention will not happen as both
                                      lk-owner, transport are same for
                                      these 2 fops.

                                    
                                Write-behind has to functions - a)
                                  performing operations in the
                                  background and b) serializing
                                  overlapping operations.
                                

                                While the problem does exist, the
                                  specifics are different from what you
                                  describe. since all writes coming in
                                  from NFS will always use the same
                                  anonymous FD, two
                                  near-in-time/overlapping writes will
                                  never contend with inodelk() but
                                  instead the second write will inherit
                                  the lock and changelog from the first.
                                  In either case, it is a problem.
                                
                                   
                                     We can add a check
                                      in glusterd for volume set to
                                      disallow such configuration, BUT
                                      by default write-behind is off in
                                      nfs graph and by default
                                      eager-lock is on. So we should
                                      either turn on write-behind for
                                      nfs or turn off eager-lock by
                                      default.

                                      
                                      Could you please suggest how to
                                      proceed with this if you agree
                                      that I did not miss any important
                                      detail that makes this theory
                                      invalid.
                                  
                                  
                                It seems loading write-behind
                                  xlator in NFS graph  looks like a
                                  simpler solution. eager-locking is
                                  crucial for replicated NFS write
                                  performance.
                                
                                    
                                    Avati
                                  
                            
                    Shall we disable eager-lock for files opened with
                    O_SYNC, for now?
                
                
            Bad news: the problem is slightly worse than just this.
              Even with non-O_SYNC writes, there is a possibility in
              write-behind where, if a second overlapping write request
              comes so close to the first request that, if wb_enqueue()
              of the second one happens after wb_enqueue() of the first
              write, but before any unwind() after the first
              wb_enqueue() (i.e wb_inode->gen is not bumped), then
              the two write requests can be wound down together to eager
              lock.
            
                
        But this has a simple fix - http://review.gluster.org/4550.
          Disabling eager-locking for O_SYNC files is a bad idea. We
          absolutely want eager-locking for O_SYNC files. Thinking
          more..
        

        Avati
      
    
    Why is disabling eager-lock for O_SYNC files a bad idea? It is
    acceptable to sacrifice a bit of performance for O_SYNC isn't it?

    
    Pranith.