Re: Strange file corruption

Joe Julian <joe@xxxxxxxxxxxxxxxx> · Wed, 9 Dec 2015 08:17:48 -0800



      A-1) shut
        down node #1 (the first that is about to be upgraded)

      A-2) remove
        node #1 from the Proxmox cluster (pvevm delnode "metal1")

      A-3) remove
        node #1 from the Gluster volume/cluster (gluster volume
        remove-brick ... && gluster peer detach "metal1")

      A-4) install
        Debian Jessie on node #1, overwriting all data on the HDD - with same Network
        settings and hostname as before

      A-5) install Proxmox 4.0 on
        node #1

      A-6) install
        Gluster on node #1 and add it back to the Gluster volume (gluster volume add-brick
        ...) =>
        shared storage will be complete again (spanning 3.4 and 4.0
        nodes)

      A-7)
        configure the Gluster volume as shared storage in Proxmox 4
        (node #1)

      A-8)
        configure the external Backup storage on node #1 (Proxmox 4)
    

    Was the data on the gluster brick deleted as part of step 4? When
    you remove the brick, gluster will no longer track pending changes
    for that brick. If you add it back in with stale data but matching
    gfids, you would have two clean bricks with mismatching data. Did
    you have to use "add-brick...force"? 

    
    On 12/09/2015 06:53 AM, Udo Giacomozzi
      wrote:

    
      Am 09.12.2015 um 14:39 schrieb
        Lindsay Mathieson:

      
        Udo, it occurs to me that if your VM's were running on #2 &
        #3 and you live migrated them to #1 prior to rebooting #2/3,
        then you would indeed rapidly get progressive VM corruption.

        
        However it wouldn't be due to the heal process, but rather the
        live migration with "performance.stat-prefetch" on. This always
        leads to qcow2 files becoming corrupted and unusable.
      

      Nope. All VMs were running on #1, no exception.

      Nodes #2 and #3 never had a VM running on them, so they were
      pratically idle since their installation.

      
      Basically I set up node #1, including all VMs. 

      Then I've installed nodes #2 and #3, configured Proxmox and
      Gluster cluster and then waited quite some time until Gluster had
      synced up nodes #2 and #3 (healing).

      From then on, I've rebooted nodes 2 & 3, but in theory these
      nodes never had to do any writes to the Gluster volume at all.

      
      If you're interested, you can read about my upgrade strategy in
      this Proxmox forum post:
      http://forum.proxmox.com/threads/24990-Upgrade-3-4-HA-cluster-to-4-0-via-reinstallation-with-minimal-downtime?p=125040#post125040

      
      Also, It seems rather strange to me that pratically all ~15 VMs 
      (!) suffered from data corruption. It's like if Gluster considered
      node #2 or #3 to be ahead and it "healed" in the wrong direction.
      I don't know..

      
      BTW, once I understood what was going on, with the problematic
        "healing" still in progress, I was able to overwrite the bad
      images (still active on #1) by using standard Proxmox
      backup-restore and Gluster handled it correctly. 

      
      Anway, I really love the simplicity of Gluster (setting up and
      maintaining a cluster is extremely easy), but these healing issues
      are causing some headache to me... ;-)

      
      Udo

      
      _______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users
    
    
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users