Re: Eager-lock and nfs graph generation

Pranith Kumar K <pkarampu@xxxxxxxxxx> · Tue, 26 Feb 2013 12:20:27 +0530



    On 02/20/2013 11:53 AM, Anand Avati
      wrote:

    
    Please check http://review.gluster.org/4551.
      This should fix all the known write-behind/eager-lock interaction
      gaps. On top of this patch, you can now set a bit in the 'flags'
      of writev fop coming out of write-behind, and look for it in AFR
      to be sure that you have the 'protection layer'  of write-behind
      offering coverage against concurrent writes. With this you can
      actually eliminate all the glusterd/volgen crud of implementing
      dependencies between the two options.
      
        
      Avati

        
    Flags parameter in writev is coming from fuse/nfs xlators. Is it ok
    if we use xdata instead of flags to convey that write-behind took
    care of overlaps?

    
    Pranith

    
        On Tue, Feb 19, 2013 at 7:20 PM, Anand
          Avati <anand.avati@xxxxxxxxx>
          wrote:

          
                On Tue, Feb 19, 2013 at 6:11 PM, Pranith
                  Kumar K <pkarampu@xxxxxxxxxx>
                  wrote:

                  
                          On 02/20/2013 07:03 AM, Anand Avati
                            wrote:

                          
                            On Tue, Feb 19,
                              2013 at 5:12 PM, Anand Avati <anand.avati@xxxxxxxxx>
                              wrote:

                               
                                    On Tue, Feb 19, 2013 at 3:59
                                      AM, Pranith Kumar K <pkarampu@xxxxxxxxxx>
                                      wrote:

                                      
                                              On 02/19/2013 11:26
                                                AM, Anand Avati wrote:

                                              
                                                Thinking over this,
                                                  looks like there is a
                                                  problem!
                                                Write-behind
                                                  guarantees: That a
                                                  second write request
                                                  arriving after the
                                                  acknowledgement of a
                                                  first overlapping
                                                  request (whether
                                                  written-behind or
                                                  otherwise) will be
                                                  guaranteed to be
                                                  fulfilled in the
                                                  backend in the same
                                                  order (i.e, the second
                                                  overlapping request
                                                  will be "serialized"
                                                  behind the first one
                                                  in the fulfillment
                                                  process)
                                                Eager-lock
                                                  requirement: That
                                                  write-behind will send
                                                  no two write requests
                                                  on an overlapping
                                                  region at the same
                                                  time.
                                                The requirement-set
                                                  and guarantee-set have
                                                  a big overlap, but the
                                                  requirement-set is not
                                                  a subset.
                                                This is because of
                                                  O_SYNC writes.
                                                  write-behind performs
                                                  write-serialization at
                                                  fulfillment only for
                                                  written behind
                                                  requests (which get
                                                  covered under the
                                                  conflict detection
                                                  code during liability
                                                  fulfillment). However,
                                                  if two threads (or
                                                  apps) issue
                                                  overlapping O_SYNC
                                                  writes to the same
                                                  region at approx same
                                                  time, then
                                                  write-behind will let
                                                  both of them go by
                                                  without any kind of
                                                  serialization, into
                                                  eager lock, violating
                                                  the assumptions!
                                                I'm wondering if it
                                                  is a safer idea to
                                                  implement overlap
                                                  checks within
                                                  eager-lock code itself
                                                  rather than depend on
                                                  write-behind :|
                                                Avati
                                                

                                                On
                                                  Mon, Feb 11, 2013 at
                                                  10:07 PM, Anand Avati
                                                  <anand.avati@xxxxxxxxx>
                                                  wrote:

                                                  
                                                      On Mon, Feb
                                                        11, 2013 at 9:32
                                                        PM, Pranith
                                                        Kumar K <pkarampu@xxxxxxxxxx>
                                                        wrote:

                                                        
                                                          hi,

                                                          Please note
                                                          that this is a
                                                          case in theory
                                                          and I did not
                                                          run into such
                                                          situation, but
                                                          I feel it is
                                                          important to
                                                          address this.
                                                          

                                                          Configuration
                                                          with
                                                          'Eager-lock
                                                          on" and
                                                          "write-behind
                                                          off" should
                                                          not be allowed
                                                          as it leads to
                                                          lock
                                                          synchronization
                                                          problems which
                                                          lead to data
                                                          in-consistency
                                                          among replicas
                                                          in nfs.

                                                          lets say
                                                          bricks b1, b2
                                                          are in
                                                          replication.

                                                          Gluster Nfs
                                                          server uses 1
                                                          anonymous fd
                                                          to perform all
                                                          write-fops. If
                                                          eager-lock is
                                                          enabled in
                                                          afr, the
                                                          lock-owner is
                                                          used as fd's
                                                          address which
                                                          will be same
                                                          for all
                                                          write-fops, so
                                                          there will
                                                          never be any
                                                          inodelk
                                                          contention. If
                                                          write-behind
                                                          is disabled,
                                                          there can be
                                                          writes that
                                                          overlap. (Does
                                                          nfs makes sure
                                                          that the
                                                          ranges don't
                                                          overlap?)

                                                          
                                                          Now imagine
                                                          the following
                                                          scenario:

                                                          lets say w1,
                                                          w2 are 2 write
                                                          fops on same
                                                          offset and
                                                          length. w1
                                                          with all '0's
                                                          and w2 with
                                                          all '1's. If
                                                          these 2 write
                                                          fops are
                                                          executed in 2
                                                          different
                                                          threads, the
                                                          order of
                                                          arrival of
                                                          write fops on
                                                          b1 can be w1,
                                                          w2 where as on
                                                          b2 it is w2,
                                                          w1 leading to
                                                          data
                                                          inconsistency
                                                          between the
                                                          two replicas.
                                                          The lock
                                                          contention
                                                          will not
                                                          happen as both
                                                          lk-owner,
                                                          transport are
                                                          same for these
                                                          2 fops.

                                                          
                                                      Write-behind
                                                        has to functions
                                                        - a) performing
                                                        operations in
                                                        the background
                                                        and b)
                                                        serializing
                                                        overlapping
                                                        operations.
                                                      

                                                      While the
                                                        problem does
                                                        exist, the
                                                        specifics are
                                                        different from
                                                        what you
                                                        describe. since
                                                        all writes
                                                        coming in from
                                                        NFS will always
                                                        use the same
                                                        anonymous FD,
                                                        two
                                                        near-in-time/overlapping
                                                        writes will
                                                        never contend
                                                        with inodelk()
                                                        but instead the
                                                        second write
                                                        will inherit the
                                                        lock and
                                                        changelog from
                                                        the first. In
                                                        either case, it
                                                        is a problem.
                                                      
                                                         
                                                          We can add a
                                                          check in
                                                          glusterd for
                                                          volume set to
                                                          disallow such
                                                          configuration,
                                                          BUT by default
                                                          write-behind
                                                          is off in nfs
                                                          graph and by
                                                          default
                                                          eager-lock is
                                                          on. So we
                                                          should either
                                                          turn on
                                                          write-behind
                                                          for nfs or
                                                          turn off
                                                          eager-lock by
                                                          default.

                                                          
                                                          Could you
                                                          please suggest
                                                          how to proceed
                                                          with this if
                                                          you agree that
                                                          I did not miss
                                                          any important
                                                          detail that
                                                          makes this
                                                          theory
                                                          invalid.
                                                        
                                                        
                                                      It seems
                                                        loading
                                                        write-behind
                                                        xlator in NFS
                                                        graph  looks
                                                        like a simpler
                                                        solution.
                                                        eager-locking is
                                                        crucial for
                                                        replicated NFS
                                                        write
                                                        performance.
                                                      
                                                          
                                                          Avati
                                                        
                                                  
                                          Shall we disable eager-lock
                                          for files opened with O_SYNC,
                                          for now?
                                      
                                      
                                  Bad news: the problem is slightly
                                    worse than just this. Even with
                                    non-O_SYNC writes, there is a
                                    possibility in write-behind where,
                                    if a second overlapping write
                                    request comes so close to the first
                                    request that, if wb_enqueue() of the
                                    second one happens after
                                    wb_enqueue() of the first write, but
                                    before any unwind() after the first
                                    wb_enqueue() (i.e wb_inode->gen
                                    is not bumped), then the two write
                                    requests can be wound down together
                                    to eager lock.
                                  
                                      
                              But this has a simple fix - http://review.gluster.org/4550.
                                Disabling eager-locking for O_SYNC files
                                is a bad idea. We absolutely want
                                eager-locking for O_SYNC files. Thinking
                                more..
                              

                              Avati
                            
                          
                      Why is disabling eager-lock for O_SYNC files a bad
                      idea? It is acceptable to sacrifice a bit of
                      performance for O_SYNC isn't it?
                  
                  
               s/bit/quite a bit/. For O_SYNC writes, eager locking
                is the only saving grace in performance as write-behind
                stays out of the way completely. We would need overlap
                checks either in AFR or write-behind for O_SYNC writes.
              
                  
                  Avati