Re: Is rebalance completely broken on 3.5.3 ?

Olav Peeters <opeeters@xxxxxxxxx> · Fri, 20 Mar 2015 18:56:19 +0100



    Hi Alessandro,

    what you describe here reminds me of this issue:

    http://www.spinics.net/lists/gluster-users/msg20144.html

    
    And now that you mention it, the mess on our cluster could indeed
    have been triggered by an aborted rebalance.

    This is a very important clue, since apparently developers were
    never able to reproduce the issue in the lab. I also tried to
    reproduce the issue on a test cluster, but never succeeded.

    
    The example you describe below seems to me relatively easy to fix. A
    
    rebalance fix-layout would eventually
    get rid of the sticky bit files (---------T) on your brick 5 and 6
    and you could manually remove the files created on 10/03 as long as
    you also remove the corresponding link file in the .glusterfs dir on
    that brick.

    
    I whole heartedly agree with you that this needs urgent attention of
    developers before they start working on new features. A mess like
    this in a distributed file system makes the file system unusable for
    production. This should never happen, never! And if it does a
    rebalance should be able to detect and fix it... fast and
    efficiently. I also agree that the status of a rebalance should be
    more telling, giving a clear idea how long it would still take to
    complete. On large clusters a rebalance often takes ages and makes
    the entire cluster extremely vulnerable. (another scary operation is
    a remove-brick operation, but this is another story)

    
    What I did in our case, maybe this could help you too as a quick fix
    for the most critical directories, is to rsync to a different
    storage (via a mount point). rsync only copies one file of
    duplicated files and you could separately copy a good version (in
    the case below e.g.: -rw-r--r-- 2 seviri users 68 May 26 2014
    /data/glusterfs/home/brick1/seviri/.forward) of the problem files.
    But probably, as soon as you remove the files created on 10/03
    (incl. the gluster link file in .glusterfs), the listing via your
    NFS mount will be restored. Try this out with a couple of files you
    have back-upped to be sure.

    
    Hope this helps!

    
    Cheers,

    Olav

     
    On 20/03/15 12:22, Alessandro Ipe
      wrote:

    
      Hi,
       
       
      After lauching a "rebalance" on an idle
        gluster system one week ago, its status told me it has scanned
       more than 23 millions files on each of my 6
        bricks. However, without knowing at least the total files to
       be scanned, this status is USELESS from an
        end-user perspective, because it does not allow you to
       know WHEN the rebalance could eventually
        complete (one day, one week, one year or never). From
       my point of view, the total files per bricks
        could be obtained and maintained when activating quota,
       since the whole filesystem has to be
        crawled...
       
      After one week being offline and still no
        clue when the rebalance would complete, I decided to stop it...
       Enormous mistake... It seems that rebalance
        cannot manage to not screw some files. Example, on
       the only client mounting the gluster system,
        "ls -la /home/seviri" returns
      ls: cannot access /home/seviri/.forward:
        Stale NFS file handle
      ls: cannot access /home/seviri/.forward:
        Stale NFS file handle
      -????????? ? ? ? ? ? .forward
      -????????? ? ? ? ? ? .forward
      while this file could perfectly be accessed
        before (being rebalanced) and has not been modifed for at
       least 3 years.
       
      Getting the extended attributes on the
        various bricks 3, 4, 5, 6 (3-4 replicate, 5-6 replicate)
      Brick 3:
      ls -l
        /data/glusterfs/home/brick?/seviri/.forward
      -rw-r--r-- 2 seviri users 68 May 26 2014
        /data/glusterfs/home/brick1/seviri/.forward
      -rw-r--r-- 2 seviri users 68 Mar 10 10:22
        /data/glusterfs/home/brick2/seviri/.forward
       
      getfattr -d -m . -e hex
        /data/glusterfs/home/brick?/seviri/.forward
      # file:
        data/glusterfs/home/brick1/seviri/.forward
      trusted.afr.home-client-8=0x000000000000000000000000
      trusted.afr.home-client-9=0x000000000000000000000000
      trusted.gfid=0xc1d268beb17443a39d914de917de123a
       
      # file:
        data/glusterfs/home/brick2/seviri/.forward
      trusted.afr.home-client-10=0x000000000000000000000000
      trusted.afr.home-client-11=0x000000000000000000000000
      trusted.gfid=0x14a1c10eb1474ef2bf72f4c6c64a90ce
      trusted.glusterfs.quota.4138a9fa-a453-4b8e-905a-e02cce07d717.contri=0x0000000000000200
      trusted.pgfid.4138a9fa-a453-4b8e-905a-e02cce07d717=0x00000001
       
      Brick 4:
      ls -l
        /data/glusterfs/home/brick?/seviri/.forward
      -rw-r--r-- 2 seviri users 68 May 26 2014
        /data/glusterfs/home/brick1/seviri/.forward
      -rw-r--r-- 2 seviri users 68 Mar 10 10:22
        /data/glusterfs/home/brick2/seviri/.forward
       
      getfattr -d -m . -e hex
        /data/glusterfs/home/brick?/seviri/.forward
      # file:
        data/glusterfs/home/brick1/seviri/.forward
      trusted.afr.home-client-8=0x000000000000000000000000
      trusted.afr.home-client-9=0x000000000000000000000000
      trusted.gfid=0xc1d268beb17443a39d914de917de123a
       
      # file:
        data/glusterfs/home/brick2/seviri/.forward
      trusted.afr.home-client-10=0x000000000000000000000000
      trusted.afr.home-client-11=0x000000000000000000000000
      trusted.gfid=0x14a1c10eb1474ef2bf72f4c6c64a90ce
      trusted.glusterfs.quota.4138a9fa-a453-4b8e-905a-e02cce07d717.contri=0x0000000000000200
      trusted.pgfid.4138a9fa-a453-4b8e-905a-e02cce07d717=0x00000001
       
      Brick 5:
      ls -l
        /data/glusterfs/home/brick?/seviri/.forward
      ---------T 2 root root 0 Mar 18 08:19
        /data/glusterfs/home/brick2/seviri/.forward
       
      getfattr -d -m . -e hex
        /data/glusterfs/home/brick?/seviri/.forward
      # file:
        data/glusterfs/home/brick2/seviri/.forward
      trusted.gfid=0x14a1c10eb1474ef2bf72f4c6c64a90ce
      trusted.glusterfs.dht.linkto=0x686f6d652d7265706c69636174652d3400
       
      Brick 6:
      ls -l
        /data/glusterfs/home/brick?/seviri/.forward
      ---------T 2 root root 0 Mar 18 08:19
        /data/glusterfs/home/brick2/seviri/.forward
       
      getfattr -d -m . -e hex
        /data/glusterfs/home/brick?/seviri/.forward
      # file:
        data/glusterfs/home/brick2/seviri/.forward
      trusted.gfid=0x14a1c10eb1474ef2bf72f4c6c64a90ce
      trusted.glusterfs.dht.linkto=0x686f6d652d7265706c69636174652d3400
       
      Looking at the results from bricks 3 & 4
        shows something weird. The file exists on 2 sub-bricks
       storage directories, while it should only be
        found once on each brick server. Or is the issue lying in the
       results of bricks 5 & 6 ? How can I fix this, please ? By the
        way, the split-brain tutorial only covers
       BASIC split-brain conditions and not complex
        (real life) cases like this one. It would definitely benefit if
       enriched by this one.
       
      More generally, I think the concept of
        gluster is promising, but if basic commands (rebalance,
       absolutely needed after adding more storage)
        from its own cli allows to put the system into an
       unstable state, I am really starting to
        question its ability to be used in a production environment. And
       from an end-user perspective, I do not care
        about new features added, no matter how appealing they
       could be, if the basic ones are not almost
        totally reliable. Finally, testing gluster under high load on
        the
       brick servers (real world conditions) would
        certainly gives insight to the developpers on what it failing
       and what needs therefore to be fixed to
        mitigate this and improve gluster reliability.
       
      Forgive my harsh words/criticisms, but having
        to struggle with gluster issues for two weeks now is
       getting on my nerves since my colleagues can
        not use the data stored on it and I do not see any time
       from now when it will be back online.
       
       
      Regards,
       
       
      Alessandro.
       
      
      _______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users
    
    
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users