Re: Eager-lock and nfs graph generation

Pranith Kumar K <pkarampu@xxxxxxxxxx> · Tue, 19 Feb 2013 17:29:40 +0530



    On 02/19/2013 11:26 AM, Anand Avati
      wrote:

    
      Thinking over this, looks like there is a problem!
      Write-behind guarantees: That a second write request
        arriving after the acknowledgement of a first overlapping
        request (whether written-behind or otherwise) will be guaranteed
        to be fulfilled in the backend in the same order (i.e, the
        second overlapping request will be "serialized" behind the first
        one in the fulfillment process)
      Eager-lock requirement: That write-behind will send
        no two write requests on an overlapping region at the same time.
      The requirement-set and guarantee-set have a big
        overlap, but the requirement-set is not a subset.
      This is because of O_SYNC writes. write-behind
        performs write-serialization at fulfillment only for written
        behind requests (which get covered under the conflict detection
        code during liability fulfillment). However, if two threads (or
        apps) issue overlapping O_SYNC writes to the same region at
        approx same time, then write-behind will let both of them go by
        without any kind of serialization, into eager lock, violating
        the assumptions!
      I'm wondering if it is a safer idea to implement
        overlap checks within eager-lock code itself rather than depend
        on write-behind :|
      Avati
      

      On Mon, Feb 11, 2013 at 10:07 PM, Anand
        Avati <anand.avati@xxxxxxxxx>
        wrote:

        
            On Mon, Feb 11, 2013 at 9:32 PM, Pranith
              Kumar K <pkarampu@xxxxxxxxxx>
              wrote:

              
                 hi,

                  Please note that this is a case in theory and I did
                  not run into such situation, but I feel it is
                  important to address this. 

                  Configuration with 'Eager-lock on" and "write-behind
                  off" should not be allowed as it leads to lock
                  synchronization problems which lead to data
                  in-consistency among replicas in nfs.

                  lets say bricks b1, b2 are in replication.

                  Gluster Nfs server uses 1 anonymous fd to perform all
                  write-fops. If eager-lock is enabled in afr, the
                  lock-owner is used as fd's address which will be same
                  for all write-fops, so there will never be any inodelk
                  contention. If write-behind is disabled, there can be
                  writes that overlap. (Does nfs makes sure that the
                  ranges don't overlap?)

                  
                  Now imagine the following scenario:

                  lets say w1, w2 are 2 write fops on same offset and
                  length. w1 with all '0's and w2 with all '1's. If
                  these 2 write fops are executed in 2 different
                  threads, the order of arrival of write fops on b1 can
                  be w1, w2 where as on b2 it is w2, w1 leading to data
                  inconsistency between the two replicas. The lock
                  contention will not happen as both lk-owner, transport
                  are same for these 2 fops.

                
            Write-behind has to functions - a) performing
              operations in the background and b) serializing
              overlapping operations.
            

            While the problem does exist, the specifics are
              different from what you describe. since all writes coming
              in from NFS will always use the same anonymous FD, two
              near-in-time/overlapping writes will never contend with
              inodelk() but instead the second write will inherit the
              lock and changelog from the first. In either case, it is a
              problem.
            
               
                 We can add a
                  check in glusterd for volume set to disallow such
                  configuration, BUT by default write-behind is off in
                  nfs graph and by default eager-lock is on. So we
                  should either turn on write-behind for nfs or turn off
                  eager-lock by default.

                  
                  Could you please suggest how to proceed with this if
                  you agree that I did not miss any important detail
                  that makes this theory invalid.
              
              
            It seems loading write-behind xlator in NFS graph
               looks like a simpler solution. eager-locking is crucial
              for replicated NFS write performance.
            
                
                Avati
              
        
    Shall we disable eager-lock for files opened with O_SYNC, for now?

    
    Pranith