Re: libvirt with sanlock

Alex Jia <ajia@xxxxxxxxxx> · Wed, 14 Mar 2012 17:56:05 +0800



    On 03/14/2012 05:39 PM, Frido Roose wrote:
    
      On Wed, Mar 14, 2012 at 8:32 AM, Alex Jia
        <ajia@xxxxxxxxxx>
        wrote:

        
               On 03/13/2012 10:42 PM, Frido Roose wrote:
                Hello,
                  

                  I configured libvirtd with the sanlock lock
                    manager plugin:
                  

                    # rpm -qa | egrep "libvirt-0|sanlock-[01]"
                    libvirt-lock-sanlock-0.9.4-23.el6_2.4.x86_64
                    sanlock-1.8-2.el6.x86_64
                    libvirt-0.9.4-23.el6_2.4.x86_64
                  
                  
                    # egrep -v "^#|^$"
                      /etc/libvirt/qemu-sanlock.conf 
                    auto_disk_leases = 1
                    disk_lease_dir = "/var/lib/libvirt/sanlock"
                    host_id = 4
                  
                  
                    # mount | grep sanlock
                    /dev/mapper/kvm--shared-sanlock on
                      /var/lib/libvirt/sanlock type gfs2
                      (rw,noatime,hostdata=jid=0)
                  
                  
                    # cat /etc/sysconfig/sanlock 
                    SANLOCKOPTS="-R 1 -o 30"
                  
                  
                  I increased the sanlock io_timeout to 30 seconds
                    (default = 10), because the sanlock dir is on a GFS2
                    volume and can be blocked for some time while
                    fencing and journal recovery takes place.
                  With the default sanlock io timeout, I get lease
                    timeouts because IO is blocked:
                  
                    
                         Mar  5 15:37:14 raiti sanlock[5858]: 3318
                        s1 check_our_lease warning 79 last_success 3239
                         Mar  5 15:37:15 raiti sanlock[5858]: 3319
                        s1 check_our_lease failed 80
                    
                  
                  So far, all fine, but when I restart sanlock and
                    libvirtd, it takes about 2 * 30 seconds = 1 minute
                    before libvirtd is usable.  "virsh list" hangs
                    during this time.  I can still live with that...
                  But it gets worse after a reboot, when running a
                    "virsh list" even takes a couple of minutes (like
                    about 5 minutes) before it responds.  After this
                    initial time, virsh is responding normally, so it
                    looks like an initialization issue to me.
                  

                  Is this a configuration issue, a bug, or expected
                    behavior?
                
              
            Hi Frido,

            I'm not sure whether you met a sanlock AVC error in your
            /var/log/audit/audit.log, could you check it and provide
            your selinux-policy version? in addition, you should turn on
            selinux bool value for sanlock, for example, 

            
            # getsebool -a|grep sanlock

            virt_use_sanlock --> off

            # setsebool -P virt_use_sanlock on

            # getsebool -a|grep sanlock

            virt_use_sanlock --> on

            
          Hello Alex,
          

          Thanks for your suggestions!  I don't have any AVC errors
            in audit.log, but I also disabled selinux on the nodes for
            now.
        
        
           In addition, could you
            provide libvirt log as a attachment? please refer the
            following configuration:

            
            1. /etc/libvirt/libvirtd.conf

            
            log_filters="1:libvirt 1:conf 1:locking"

            log_outputs="1:file:/var/log/libvirt/libvirtd.log"

            
            2. service libvirtd restart

            
            3. repeat your test steps

            
        I enabled the extra debug logging, which you can find as
          attachment.  Instead of restarting libvirtd, I did a echo b
          >/proc/sysreq-trigger to force an unclean reboot.
        The result was that it took 300s to register the lockspace:
        
             09:59:28.919: 3457: debug :
            virLockManagerSanlockInit:267 : version=1000000
            configFile=/etc/libvirt/qemu-sanlock.conf flags=0
             10:05:29.539: 3457: debug :
            virLockManagerSanlockSetupLockspace:247 : Lockspace
            /var/lib/libvirt/sanlock/__LIBVIRT__DISKS__ has been
            registered
        
        
        I also had a little discussion about this with David
          Teigland on the sanlock dev list.  He gave me some more
          details about how the delays and timeouts work, and it
          explains why it takes 300s.
        I'll quote his reply:
      
    
    Frido, David gave a great explanation, thanks for you forwarding
    these.

    
        David:
        "Yes, all the timeouts are derived from the
            io_timeout and are dictated by
        the recovery requirements and the algorithm the host_id
          leases are based

        on: "Light-Weight Leases for Storage-Centric Coordination"
          by Gregory

        Chockler and Dahlia Malkhi.

        
        Here are the actual equations copied from
          sanlock_internal.h.

        "delta" refers to host_id leases that take a long time to
          acquire at startup

        "free" corresponds to starting up after a clean shutdown

        "held" corresponds to starting up after an unclean
          shutdown

        
        You should find that with 30 sec io timeout these come out
          to 1 min / 4 min

        which you see when starting after a clean / unclean
          shutdown."
      
        
      Since I configured an io_timeout of 30s in sanlock
          (SANLOCKOPTS="-R 1 -o 30"),
      the delays at sanlock startup is defined by
          the delta_acquire_held_min variable, 
      which is calculated as:
      

                  int max = host_dead_seconds;
                  if (delta_large_delay >
            max)
                          max =
            delta_large_delay;
          

                  int delta_acquire_held_min =
            max;
        
      
      So max is host_dead_seconds, which is calculated
          as:
      int host_dead_seconds      =
          id_renewal_fail_seconds + WATCHDOG_FIRE_TIMEOUT;
      

      And id_renewal_fail_seconds is:
      int id_renewal_fail_seconds = 8 *
          io_timeout_seconds;
      

      WATCHDOG_FIRE_TIMEOUT = 60

        
      So that makes 8 * 30 + 60, or a total of 300s
          before the lock can be acquired.
      And that's exactly the time that is shown in
          libvirtd.log before the lockspace is registered.
      

      When a proper reboot is done, or when
          sanlock/libvirtd is just restarted, delta_acquire_free_min
      defines the delay:
      int delta_short_delay      = 2 *
          io_timeout_seconds;

        
      int delta_acquire_free_min = delta_short_delay;
      Which is confirmed by the 60s delay in the
          libvirtd.log file:
      
          
               10:33:54.097: 7983: debug : virLockManagerSanlockInit:267
            : version=1000000 configFile=/etc/libvirt/qemu-sanlock.conf
            flags=0
             10:34:55.111: 7983: debug :
            virLockManagerSanlockSetupLockspace:247 : Lockspace
            /var/lib/libvirt/sanlock/__LIBVIRT__DISKS__ has been
            registered
        
        
        So the whole delay is caused by the io_timeout which is set
          to 30s because the lockspace is on GFS2,
        and the GFS2 volume can be locked for some time while a
          node gets fenced, and the journal is applied.
        Depending on clean/unclean restarts, the delay may differ.
        

        So delta_acquire_held_min is based on host_dead_seconds, because sanlock wants
            to be sure that the
        lock won't be reacquired before the host
            is actually dead.
        David told me that sanlock is supposed to
            use a block device.
        I wonder if this makes sense when used on
            top of GFS2, which already has its own locking and safety
        mechanism...  It looks to me like it just
            adds the delay without any reason in this case?
        Perhaps delta_acquire_held_min
            should be the same as delta_acquire_free_min
            because we don't need
        this safety delay on top of gfs2.
        

        For NFS, it may be a whole other story...
        

        Best regards,
        Frido