Issues removing then adding a brick to a replica volume (Gluster 3.7.6)

Lindsay Mathieson <lindsay.mathieson@xxxxxxxxx> · Mon, 18 Jan 2016 15:49:22 +1000



    Been running through my eternal testing regime ... and experimenting
    with removing/adding bricks - to me, a necessary part of volume
    maintenance for dealing with failed disks. The datastore is a VM
    host and all the following is done live. Sharding is active with a
    512MB shard size.

    
    So I started off with a replica 3 volume

    
    // recreated from memory

      Volume Name: datastore1

      Type: Replicate

      Volume ID: bf882533-f1a9-40bf-a13e-d26d934bfa8b

      Status: Started

      Number of Bricks: 1 x 3 = 3

      Transport-type: tcp

      Bricks:

      Brick1: vnb.proxmox.softlog:/vmdata/datastore1

      Brick2: vng.proxmox.softlog:/vmdata/datastore1

      Brick3: vna.proxmox.softlog:/vmdata/datastore1

    
    I remove a brick with:

    
    gluster volume remove-brick datastore1 replica 2 
      vng.proxmox.softlog:/vmdata/datastore1 force

    
    so we end up with:

    
    Volume Name: datastore1

      Type: Replicate

      Volume ID: bf882533-f1a9-40bf-a13e-d26d934bfa8b

      Status: Started

      Number of Bricks: 1 x 2 = 2

      Transport-type: tcp

      Bricks:

      Brick1: vna.proxmox.softlog:/vmdata/datastore1

      Brick2: vnb.proxmox.softlog:/vmdata/datastore1

    
    All well and good. No heal issues, VM's running ok.

    
    Then I clean the brick off the vng host:

    
    rm -rf /vmdata/datastore1

      
    I then add the brick back with:

    
    gluster volume add-brick datastore1 replica 3 
        vng.proxmox.softlog:/vmdata/datastore1 

        
        Volume Name: datastore1

        Type: Replicate

        Volume ID: bf882533-f1a9-40bf-a13e-d26d934bfa8b

        Status: Started

        Number of Bricks: 1 x 3 = 3

        Transport-type: tcp

        Bricks:

        Brick1: vna.proxmox.softlog:/vmdata/datastore1

        Brick2: vnb.proxmox.softlog:/vmdata/datastore1

        Brick3: vng.proxmox.softlog:/vmdata/datastore1

      
    This recreates the brick directory "datastore1". Unfortunately
    this is where things start to go wrong :( Heal info:

    
    gluster volume heal datastore1 info

      Brick vna.proxmox.softlog:/vmdata/datastore1

      /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.57 

      /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.5 

      Number of entries: 2

      
      Brick vnb.proxmox.softlog:/vmdata/datastore1

      /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.5 

      /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.57 

      Number of entries: 2

      
      Brick vng.proxmox.softlog:/vmdata/datastore1

      /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.1 

      /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.6 

      /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.15 

      /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.18 

      /.shard/d6aad699-d71d-4b35-b021-d35e5ff297c4.5 

    
    Its my understanding that there shouldn't be any heal entries on vng
    as it that is where all the shards should be sent *to*

    
    also running qemu-img check on the hosted VM images results in a I/O
    error. Eventually the VM's themselves crash - I suspect this is due
    to individual shards being unreadable.

    
    Another odd behaviour I get is if I run a full heal on vnb I get the
    following error:

    
    Launching heal operation to perform full self heal
        on volume datastore1 has been unsuccessful

    
    However if I run it on VNA, it succeeds.

    
    Lastly - if I remove the brick everythign returns to normal
    immediately. Heal Info shows no issues and qemu-img check returns no
    errors.

    
    -- 
Lindsay Mathieson
  

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users