Re: Hundreds of duplicate files

Olav Peeters <opeeters@xxxxxxxxx> · Fri, 20 Feb 2015 22:47:07 +0100



    Thanks Joe,

      for the answers!

      
      I was not clear enough about the set up apparently.

      The Gluster cluster consist of 3 nodes with each 14 bricks. The
      bricks are formatted as xfs, mounted locally as xfs. There is one
      volume, type: Distributed-Replicate (replica 2). The configuration
      is so that bricks are mirrored on two different nodes.

      
      The NFS mount which was alive but not used during reboot when the
      problem started are from clients (2 XenServer machines configured
      as a pool - a shared storage set-up). The comparisons I give below
      are between (other) clients mounting via either glusterfs or NFS.
      Similar problem with the exception that the first listing (via ls)
      after a fresh mount via NFS actually does find the files with
      data. A second listing only finds the 0 bit file with the same
      name.

      
      So all the 0bit files in mode 0644 can be safely removed?

      
      Why do I see three files with the same name (and modification
      timestamp etc.) via either a glusterfs or NFS mount from a client?
      Deleting one of the three will probably not solve the issue
      either.. this seems to me an indexing issue in the gluster
      cluster.

      
      How do I get Gluster to replicate the files correctly, only 2
      versions of the same file, not three, and on two bricks on
      different machines?

      
      Cheers,

      Olav

      
      On 20/02/15 21:51, Joe Julian wrote:

    
      On 02/20/2015 12:21 PM, Olav Peeters
        wrote:

      
        Let's take one file
          (3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd) as an example...

          On the 3 nodes where all bricks are formatted as XFS and
          mounted in /export and 272b2366-dfbf-ad47-2a0f-5d5cc40863e3 is
          the mounting point of a NFS shared storage connection from
          XenServer machines:

        
      Did I just read this correctly? Your bricks are NFS mounts? ie,
      GlusterFS Client <-> GlusterFS Server <-> NFS
      <-> XFS

      
          [root@gluster01 ~]# find
          /export/*/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/ -name '300*'
          -exec ls -la {} \;

          -rw-r--r--. 2 root root 44332659200 Feb 17 23:55
/export/brick13gfs01/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd

        
      Supposedly, this is the actual file.

      
         -rw-r--r--. 2 root root 0 Feb 18
          00:51
/export/brick14gfs01/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd

        
      This is not a linkfile. Note it's mode 0644. How it got there with
      those permissions would be a matter of history and would require
      information that's probably lost.

      
          root@gluster02 ~]# find
          /export/*/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/ -name '300*'
          -exec ls -la {} \;

          -rw-r--r--. 2 root root 44332659200 Feb 17 23:55
/export/brick13gfs02/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd

          
          [root@gluster03 ~]# find
          /export/*/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/ -name '300*'
          -exec ls -la {} \;

          -rw-r--r--. 2 root root 44332659200 Feb 17 23:55
/export/brick13gfs03/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd

          -rw-r--r--. 2 root root 0 Feb 18 00:51
/export/brick14gfs03/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd

        
      Same analysis as above.

      
          3 files with information, 2 x a 0-bit file with the same name

          
          Checking the 0-bit files:

          [root@gluster01 ~]# getfattr -m . -d -e hex
/export/brick14gfs01/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd

          getfattr: Removing leading '/' from absolute path names

          # file:
export/brick14gfs01/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd

security.selinux=0x73797374656d5f753a6f626a6563745f723a66696c655f743a733000

          trusted.afr.dirty=0x000000000000000000000000

          trusted.afr.sr_vol01-client-34=0x000000000000000000000000

          trusted.afr.sr_vol01-client-35=0x000000000000000000000000

          trusted.gfid=0xaefd184508414a8f8408f1ab8aa7a417

          
          [root@gluster03 ~]# getfattr -m . -d -e hex
/export/brick14gfs03/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd

          getfattr: Removing leading '/' from absolute path names

          # file:
export/brick14gfs03/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd

security.selinux=0x73797374656d5f753a6f626a6563745f723a66696c655f743a733000

          trusted.afr.dirty=0x000000000000000000000000

          trusted.afr.sr_vol01-client-34=0x000000000000000000000000

          trusted.afr.sr_vol01-client-35=0x000000000000000000000000

          trusted.gfid=0xaefd184508414a8f8408f1ab8aa7a417

          
          This is not a glusterfs link file since there is no
          "trusted.glusterfs.dht.linkto", am I correct? 

        
      You are correct.

      
          And checking the "good" files:

          
          # file:
export/brick13gfs01/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd

security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a66696c655f743a733000

          trusted.afr.dirty=0x000000000000000000000000

          trusted.afr.sr_vol01-client-32=0x000000000000000000000000

          trusted.afr.sr_vol01-client-33=0x000000000000000000000000

          trusted.afr.sr_vol01-client-34=0x000000000000000000000000

          trusted.afr.sr_vol01-client-35=0x000000010000000100000000

          trusted.gfid=0xaefd184508414a8f8408f1ab8aa7a417

          
          [root@gluster02 ~]# getfattr -m . -d -e hex
/export/brick13gfs02/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd

          getfattr: Removing leading '/' from absolute path names

          # file:
export/brick13gfs02/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd

security.selinux=0x73797374656d5f753a6f626a6563745f723a66696c655f743a733000

          trusted.afr.dirty=0x000000000000000000000000

          trusted.afr.sr_vol01-client-32=0x000000000000000000000000

          trusted.afr.sr_vol01-client-33=0x000000000000000000000000

          trusted.gfid=0xaefd184508414a8f8408f1ab8aa7a417

          
          [root@gluster03 ~]# getfattr -m . -d -e hex
/export/brick13gfs03/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd

          getfattr: Removing leading '/' from absolute path names

          # file:
export/brick13gfs03/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd

security.selinux=0x73797374656d5f753a6f626a6563745f723a66696c655f743a733000

          trusted.afr.dirty=0x000000000000000000000000

          trusted.afr.sr_vol01-client-40=0x000000000000000000000000

          trusted.afr.sr_vol01-client-41=0x000000000000000000000000

          trusted.gfid=0xaefd184508414a8f8408f1ab8aa7a417

          
          Seen from a client via a glusterfs mount:

          [root@client ~]# ls -al
          /mnt/glusterfs/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/300*

          -rw-r--r--. 1 root root 0 Feb 18 00:51
/mnt/glusterfs/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd

          -rw-r--r--. 1 root root 0 Feb 18 00:51
/mnt/glusterfs/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd

          -rw-r--r--. 1 root root 0 Feb 18 00:51
/mnt/glusterfs/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd

          
          Via NFS (just after performing a umount and mount the volume
          again):

          [root@client ~]# ls -al
          /mnt/nfs/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/300*                                    


          -rw-r--r--. 1 root root 44332659200 Feb 17 23:55
/mnt/test/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd

          -rw-r--r--. 1 root root 44332659200 Feb 17 23:55
/mnt/test/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd

          -rw-r--r--. 1 root root 44332659200 Feb 17 23:55
/mnt/test/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd

          
          Doing the same list a couple of seconds later:

          [root@client ~]# ls -al
          /mnt/nfs/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/300*

          -rw-r--r--. 1 root root 0 Feb 18 00:51
/mnt/test/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd

          -rw-r--r--. 1 root root 0 Feb 18 00:51
/mnt/test/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd

          -rw-r--r--. 1 root root 0 Feb 18 00:51
/mnt/test/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd

          And again, and again, and again:

          [root@client ~]# ls -al
          /mnt/nfs/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/300*

          -rw-r--r--. 1 root root 0 Feb 18 00:51
/mnt/test/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd

          -rw-r--r--. 1 root root 0 Feb 18 00:51
/mnt/test/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd

          -rw-r--r--. 1 root root 0 Feb 18 00:51
/mnt/test/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd

          
          This really seems odd. Why do we get to see "real data file"
          once only?

          
          It seems more and more that this crazy file duplication (and
          writing of sticky bit files) was actually triggered when
          rebooting one of the three nodes while there still is an
          active (even when there is no data exchange at all) NFS
          connection, since all 0-bit files (of the non Sticky bit type)
          were either created at 00:51 or 00:41, the exact moment one of
          the three nodes in the cluster were rebooted. This would mean
          that replication currently with GlusterFS creates hardly any
          redundancy. Quiet the opposite, if one of the machines goes
          down, all of your data seriously gets disorganised. I am buzzy
          configuring a test installation to see how this can be best
          reproduced for a bug report..

          
          Does anyone have a suggestion how to best get rid of the
          duplicates, or rather get this mess organised the way it
          should be?

          This is a cluster with millions of files. A rebalance does not
          fix the issue, neither does a rebalance fix-layout help. Since
          this is a replicated volume all files should be their 2x, not
          3x. Can I safely just remove all the 0 bit files outside of
          the .glusterfs directory including the sticky bit files?

          
          The empty 0 bit files outside of .glusterfs on every brick I
          can probably safely removed like this:

          find /export/* -path */.glusterfs -prune -o -type f -size 0
          -perm 1000 -exec rm {} \;

          not?

          
          Thanks!

          
          Cheers,

          Olav

          On 18/02/15 22:10, Olav Peeters wrote:

        
          Thanks Tom and Joe,

            for the fast response!

            
            Before I started my upgrade I stopped all clients using the
            volume and stopped all VM's with VHD on the volume, but I
            guess, and this may be the missing thing to reproduce this
            in a lab, I did not detach a NFS shared storage mount from a
            XenServer pool to this volume, since this is an extremely
            risky business. I also did not stop the volume. This I guess
            was a bit stupid, but since I did upgrades in the past this
            way without any issues I skipped this step (a really bad
            habit). I'll make amends and file a proper bug report :-). I
            agree with you Joe, this should never happen, even when
            someone ignores the advice of stopping the volume. If it
            would also be nessessary to detach shared storage NFS
            connections to a volume, than franky, glusterfs is unusable
            in a private cloud. No one can afford downtime of the whole
            infrastructure just for a glusterfs upgrade. Ideally a
            replicated gluster volume should even be able to remain
            online and used during (at least a minor version) upgrade.

            
            I don't know whether a heal was maybe buzzy when I started
            the upgrade. I forgot to check. I did check the CPU activity
            on the gluster nodes which were very low (in the 0.0X range
            via top), so I doubt it. I will add this to the bug report
            as a suggestion should they not be able to reproduce with an
            open NFS connection.

            
            By the way, is it sufficient to do:

            service glusterd stop

            service glusterfsd stop

            and do a:

            ps aux | gluster*

            to see if everything has stopped and kill any leftovers
            should this be necessary?

            
            For the fix, do you agree that if I run e.g.:

            find /export/* -type f -size 0 -perm 1000 -exec /bin/rm {}
            \;

            on every node if /export is the location of all my bricks,
            also in a replicated set-up, this will be save?

            No necessary 0bit files will be deleted in e.g. the
            .glusterfs of every brick?

            
            Thanks for your support!

            
            Cheers,

            Olav

            
            On 18/02/15 20:51, Joe Julian wrote:

          
            On 02/18/2015 11:43 AM, tbenzvi@xxxxxxxxxxxxxxx
              wrote:

            
              Hi Olav,
              

                I have a hunch that our problem was caused by improper
                unmounting of the gluster volume, and have since found
                that the proper order should be: kill all jobs using
                volume -> unmount volume on clients -> gluster
                volume stop -> stop gluster service (if necessary)
               
              In my case, I wrote a Python script to find duplicate
                files on the mounted volume, then delete the
                corresponding link files on the bricks (making sure to
                also delete files in the .glusterfs directory)
               
              However, your find command was also suggested to me
                and I think it's a simpler solution. I believe removing
                all link files (even ones that are not causing
                duplicates) is fine since the next file access gluster
                will do a lookup on all bricks and recreate any link
                files if necessary. Hopefully a gluster expert can chime
                in on this point as I'm not completely sure.
            
            
            You are correct.

            
              Keep in mind your setup is somewhat different than
                mine as I have only 5 bricks with no replication.
               
              Regards,
              Tom
               
              ---------


                Original Message ---------
                Subject: Re:  Hundreds of duplicate
                  files

                  From: "Olav Peeters" <opeeters@xxxxxxxxx>

                  Date: 2/18/15 10:52 am

                  To: gluster-users@xxxxxxxxxxx,
                  tbenzvi@xxxxxxxxxxxxxxx

                  
                  Hi all,

                    I'm have this problem after upgrading from 3.5.3 to
                    3.6.2.

                    At the moment I am still waiting for a heal to
                    finish (on a 31TB volume with 42 bricks, replicated
                    over three nodes).

                    
                    Tom,

                    how did you remove the duplicates?

                    with 42 bricks I will not be able to do this
                    manually..

                    Did a:

                    find $brick_root -type f -size 0 -perm 1000 -exec
                    /bin/rm {} \;

                    work for you?

                    
                    Should this type of thing ideally not be checked and
                    mended by a heal?

                    
                    Does anyone have an idea yet how this happens in the
                    first place? Can it be connected to upgrading?

                    
                    Cheers,

                    Olav

                     
                    On 01/01/15 03:07, tbenzvi@xxxxxxxxxxxxxxx
                    wrote:
                  
                    No, the files can be read on a newly mounted
                      client! I went ahead and deleted all of the link
                      files associated with these duplicates, and then
                      remounted the volume. The problem is fixed!
                    Thanks again for the help, Joe and Vijay.
                     
                    Tom
                     
                    ---------
                      Original Message ---------
                      Subject: Re:  Hundreds of
                        duplicate files

                        From: "Vijay Bellur" <vbellur@xxxxxxxxxx>

                        Date: 12/28/14 3:23 am

                        To: tbenzvi@xxxxxxxxxxxxxxx,
                        gluster-users@xxxxxxxxxxx

                        
                        On 12/28/2014 01:20 PM, tbenzvi@xxxxxxxxxxxxxxx
                        wrote:

                        > Hi Vijay,

                        > Yes the files are still readable from the
                        .glusterfs path.

                        > There is no explicit error. However, trying
                        to read a text file in

                        > python simply gives me null characters:

                        >

                        > >>>
                        open('ott_mf_itab').readlines()

                        >
['\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00']

                        >

                        > And reading binary files does the same

                        >

                        
                        Is this behavior seen with a freshly mounted
                        client too?

                        
                        -Vijay

                        
                        > --------- Original Message ---------

                        > Subject: Re:  Hundreds of
                        duplicate files

                        > From: "Vijay Bellur" <vbellur@xxxxxxxxxx>

                        > Date: 12/27/14 9:57 pm

                        > To: tbenzvi@xxxxxxxxxxxxxxx,
                        gluster-users@xxxxxxxxxxx

                        >

                        > On 12/28/2014 10:13 AM, tbenzvi@xxxxxxxxxxxxxxx
                        wrote:

                        > > Thanks Joe, I've read your blog post
                        as well as your post

                        > regarding the

                        > > .glusterfs directory.

                        > > I found some unneeded duplicate files
                        which were not being read

                        > > properly. I then deleted the link file
                        from the brick. This always

                        > > removes the duplicate file from the
                        listing, but the file does not

                        > > always become readable. If I also
                        delete the associated file in the

                        > > .glusterfs directory on that brick,
                        then some more files become

                        > > readable. However this solution still
                        doesn't work for all files.

                        > > I know the file on the brick is not
                        corrupt as it can be read

                        > directly

                        > > from the brick directory.

                        >

                        > For files that are not readable from the
                        client, can you check if the

                        > file is readable from the .glusterfs/ path?

                        >

                        > What is the specific error that is seen
                        while trying to read one such

                        > file from the client?

                        >

                        > Thanks,

                        > Vijay

                        >

                        >

                        >

                        >
                        _______________________________________________

                        > Gluster-users mailing list

                        > Gluster-users@xxxxxxxxxxx

                        > http://www.gluster.org/mailman/listinfo/gluster-users

                        >
                    
                    
                    _______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users
                  
                
              _______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users
            
            
            _______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users
          
          
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users