On 02/20/2015 01:47 PM, Olav Peeters
wrote:
Thanks Joe,
for the answers!
I was not clear enough about the set up apparently.
The Gluster cluster consist of 3 nodes with each 14 bricks. The
bricks are formatted as xfs, mounted locally as xfs. There is
one volume, type: Distributed-Replicate (replica 2). The
configuration is so that bricks are mirrored on two different
nodes.
The NFS mount which was alive but not used during reboot when
the problem started are from clients (2 XenServer machines
configured as a pool - a shared storage set-up). The comparisons
I give below are between (other) clients mounting via either
glusterfs or NFS. Similar problem with the exception that the
first listing (via ls) after a fresh mount via NFS actually does
find the files with data. A second listing only finds the 0 bit
file with the same name.
So all the 0bit files in mode 0644 can be safely removed?
Probably? Is it likely that you have any empty files? I don't know.
Why do I see three files with the same name (and modification
timestamp etc.) via either a glusterfs or NFS mount from a
client? Deleting one of the three will probably not solve the
issue either.. this seems to me an indexing issue in the gluster
cluster.
Very good question. I don't know. The xattrs tell a strange story
that I haven't seen before. One legit file shows sr_vol01-client-32
and 33. This would be normal, assuming the filename hash would put
it on that replica pair (we can't tell since the rebalance has
changed the hash map). Another file shows sr_vol01-client-32, 33,
34, and 35 with pending updates scheduled for 35. I have no idea
which brick this is (see "gluster volume info" and map the digits
(35) with the bricks offset by 1 (client-35 is brick 36). That last
one is on 40,41.
I don't know how these files all got on different replica sets. My
speculations include hostname changes, long-running net-split
conditions with different dht maps (failed rebalances), moved
bricks, load balancers between client and server, mercury in
retrograde (lol)...
How do I get Gluster to replicate
the files correctly, only 2 versions of the same file, not
three, and on two bricks on different machines?
Identify which replica is correct by using the little python script
at http://joejulian.name/blog/dht-misses-are-expensive/ to get the
hash of the filename. Examine the dht map to see which replica pair
*should* have that hash and remove the others (and their hardlink in
.glusterfs). There is no 1-liner that's going to do this. I would
probably script the logic in python, have it print out what it was
going to do, check that for sanity and, if sane, execute it.
But mostly figure out how Bricks 32 and/or 33 can become 34 and/or
35 and/or 40 and/or 41. That's the root of the whole problem.
Cheers,
Olav
On 20/02/15 21:51, Joe Julian wrote:
On 02/20/2015 12:21 PM, Olav
Peeters wrote:
Let's take one file
(3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd) as an example...
On the 3 nodes where all bricks are formatted as XFS and
mounted in /export and 272b2366-dfbf-ad47-2a0f-5d5cc40863e3
is the mounting point of a NFS shared storage connection
from XenServer machines:
Did I just read this correctly? Your bricks are NFS mounts? ie,
GlusterFS Client <-> GlusterFS Server <-> NFS
<-> XFS
[root@gluster01 ~]# find
/export/*/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/ -name '300*'
-exec ls -la {} \;
-rw-r--r--. 2 root root 44332659200 Feb 17 23:55
/export/brick13gfs01/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
Supposedly, this is the actual file.
-rw-r--r--. 2 root root 0 Feb 18
00:51
/export/brick14gfs01/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
This is not a linkfile. Note it's mode 0644. How it got there
with those permissions would be a matter of history and would
require information that's probably lost.
root@gluster02 ~]# find
/export/*/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/ -name '300*'
-exec ls -la {} \;
-rw-r--r--. 2 root root 44332659200 Feb 17 23:55
/export/brick13gfs02/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
[root@gluster03 ~]# find
/export/*/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/ -name '300*'
-exec ls -la {} \;
-rw-r--r--. 2 root root 44332659200 Feb 17 23:55
/export/brick13gfs03/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
-rw-r--r--. 2 root root 0 Feb 18 00:51
/export/brick14gfs03/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
Same analysis as above.
3 files with information, 2 x a 0-bit file with the same
name
Checking the 0-bit files:
[root@gluster01 ~]# getfattr -m . -d -e hex
/export/brick14gfs01/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
getfattr: Removing leading '/' from absolute path names
# file:
export/brick14gfs01/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
security.selinux=0x73797374656d5f753a6f626a6563745f723a66696c655f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.sr_vol01-client-34=0x000000000000000000000000
trusted.afr.sr_vol01-client-35=0x000000000000000000000000
trusted.gfid=0xaefd184508414a8f8408f1ab8aa7a417
[root@gluster03 ~]# getfattr -m . -d -e hex
/export/brick14gfs03/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
getfattr: Removing leading '/' from absolute path names
# file:
export/brick14gfs03/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
security.selinux=0x73797374656d5f753a6f626a6563745f723a66696c655f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.sr_vol01-client-34=0x000000000000000000000000
trusted.afr.sr_vol01-client-35=0x000000000000000000000000
trusted.gfid=0xaefd184508414a8f8408f1ab8aa7a417
This is not a glusterfs link file since there is no
"trusted.glusterfs.dht.linkto", am I correct?
You are correct.
And checking the "good" files:
# file:
export/brick13gfs01/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a66696c655f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.sr_vol01-client-32=0x000000000000000000000000
trusted.afr.sr_vol01-client-33=0x000000000000000000000000
trusted.afr.sr_vol01-client-34=0x000000000000000000000000
trusted.afr.sr_vol01-client-35=0x000000010000000100000000
trusted.gfid=0xaefd184508414a8f8408f1ab8aa7a417
[root@gluster02 ~]# getfattr -m . -d -e hex
/export/brick13gfs02/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
getfattr: Removing leading '/' from absolute path names
# file:
export/brick13gfs02/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
security.selinux=0x73797374656d5f753a6f626a6563745f723a66696c655f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.sr_vol01-client-32=0x000000000000000000000000
trusted.afr.sr_vol01-client-33=0x000000000000000000000000
trusted.gfid=0xaefd184508414a8f8408f1ab8aa7a417
[root@gluster03 ~]# getfattr -m . -d -e hex
/export/brick13gfs03/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
getfattr: Removing leading '/' from absolute path names
# file:
export/brick13gfs03/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
security.selinux=0x73797374656d5f753a6f626a6563745f723a66696c655f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.sr_vol01-client-40=0x000000000000000000000000
trusted.afr.sr_vol01-client-41=0x000000000000000000000000
trusted.gfid=0xaefd184508414a8f8408f1ab8aa7a417
Seen from a client via a glusterfs mount:
[root@client ~]# ls -al
/mnt/glusterfs/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/300*
-rw-r--r--. 1 root root 0 Feb 18 00:51
/mnt/glusterfs/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
-rw-r--r--. 1 root root 0 Feb 18 00:51
/mnt/glusterfs/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
-rw-r--r--. 1 root root 0 Feb 18 00:51
/mnt/glusterfs/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
Via NFS (just after performing a umount and mount the volume
again):
[root@client ~]# ls -al
/mnt/nfs/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/300*
-rw-r--r--. 1 root root 44332659200 Feb 17 23:55
/mnt/test/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
-rw-r--r--. 1 root root 44332659200 Feb 17 23:55
/mnt/test/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
-rw-r--r--. 1 root root 44332659200 Feb 17 23:55
/mnt/test/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
Doing the same list a couple of seconds later:
[root@client ~]# ls -al
/mnt/nfs/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/300*
-rw-r--r--. 1 root root 0 Feb 18 00:51
/mnt/test/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
-rw-r--r--. 1 root root 0 Feb 18 00:51
/mnt/test/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
-rw-r--r--. 1 root root 0 Feb 18 00:51
/mnt/test/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
And again, and again, and again:
[root@client ~]# ls -al
/mnt/nfs/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/300*
-rw-r--r--. 1 root root 0 Feb 18 00:51
/mnt/test/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
-rw-r--r--. 1 root root 0 Feb 18 00:51
/mnt/test/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
-rw-r--r--. 1 root root 0 Feb 18 00:51
/mnt/test/272b2366-dfbf-ad47-2a0f-5d5cc40863e3/3009f448-cf6e-413f-baec-c3b9f0cf9d72.vhd
This really seems odd. Why do we get to see "real data file"
once only?
It seems more and more that this crazy file duplication (and
writing of sticky bit files) was actually triggered when
rebooting one of the three nodes while there still is an
active (even when there is no data exchange at all) NFS
connection, since all 0-bit files (of the non Sticky bit
type) were either created at 00:51 or 00:41, the exact
moment one of the three nodes in the cluster were rebooted.
This would mean that replication currently with GlusterFS
creates hardly any redundancy. Quiet the opposite, if one of
the machines goes down, all of your data seriously gets
disorganised. I am buzzy configuring a test installation to
see how this can be best reproduced for a bug report..
Does anyone have a suggestion how to best get rid of the
duplicates, or rather get this mess organised the way it
should be?
This is a cluster with millions of files. A rebalance does
not fix the issue, neither does a rebalance fix-layout help.
Since this is a replicated volume all files should be their
2x, not 3x. Can I safely just remove all the 0 bit files
outside of the .glusterfs directory including the sticky bit
files?
The empty 0 bit files outside of .glusterfs on every brick I
can probably safely removed like this:
find /export/* -path */.glusterfs -prune -o -type f -size 0
-perm 1000 -exec rm {} \;
not?
Thanks!
Cheers,
Olav
On 18/02/15 22:10, Olav Peeters wrote:
Thanks Tom and Joe,
for the fast response!
Before I started my upgrade I stopped all clients using
the volume and stopped all VM's with VHD on the volume,
but I guess, and this may be the missing thing to
reproduce this in a lab, I did not detach a NFS shared
storage mount from a XenServer pool to this volume, since
this is an extremely risky business. I also did not stop
the volume. This I guess was a bit stupid, but since I did
upgrades in the past this way without any issues I skipped
this step (a really bad habit). I'll make amends and file
a proper bug report :-). I agree with you Joe, this should
never happen, even when someone ignores the advice of
stopping the volume. If it would also be nessessary to
detach shared storage NFS connections to a volume, than
franky, glusterfs is unusable in a private cloud. No one
can afford downtime of the whole infrastructure just for a
glusterfs upgrade. Ideally a replicated gluster volume
should even be able to remain online and used during (at
least a minor version) upgrade.
I don't know whether a heal was maybe buzzy when I started
the upgrade. I forgot to check. I did check the CPU
activity on the gluster nodes which were very low (in the
0.0X range via top), so I doubt it. I will add this to the
bug report as a suggestion should they not be able to
reproduce with an open NFS connection.
By the way, is it sufficient to do:
service glusterd stop
service glusterfsd stop
and do a:
ps aux | gluster*
to see if everything has stopped and kill any leftovers
should this be necessary?
For the fix, do you agree that if I run e.g.:
find /export/* -type f -size 0 -perm 1000 -exec /bin/rm {}
\;
on every node if /export is the location of all my bricks,
also in a replicated set-up, this will be save?
No necessary 0bit files will be deleted in e.g. the
.glusterfs of every brick?
Thanks for your support!
Cheers,
Olav
On 18/02/15 20:51, Joe Julian wrote:
Hi Olav,
I have a hunch that our problem was caused by improper
unmounting of the gluster volume, and have since found
that the proper order should be: kill all jobs using
volume -> unmount volume on clients -> gluster
volume stop -> stop gluster service (if necessary)
In my case, I wrote a Python script to find
duplicate files on the mounted volume, then delete the
corresponding link files on the bricks (making sure to
also delete files in the .glusterfs directory)
However, your find command was also suggested to me
and I think it's a simpler solution. I believe
removing all link files (even ones that are not
causing duplicates) is fine since the next file access
gluster will do a lookup on all bricks and recreate
any link files if necessary. Hopefully a gluster
expert can chime in on this point as I'm not
completely sure.
You are correct.
Keep in mind your setup is somewhat different than
mine as I have only 5 bricks with no replication.
Regards,
Tom
--------- Original Message ---------
Subject: Re: Hundreds of
duplicate files
From: "Olav Peeters" <opeeters@xxxxxxxxx>
Date: 2/18/15 10:52 am
To: gluster-users@xxxxxxxxxxx,
tbenzvi@xxxxxxxxxxxxxxx
Hi all,
I'm have this problem after upgrading from 3.5.3
to 3.6.2.
At the moment I am still waiting for a heal to
finish (on a 31TB volume with 42 bricks,
replicated over three nodes).
Tom,
how did you remove the duplicates?
with 42 bricks I will not be able to do this
manually..
Did a:
find $brick_root -type f -size 0 -perm 1000 -exec
/bin/rm {} \;
work for you?
Should this type of thing ideally not be checked
and mended by a heal?
Does anyone have an idea yet how this happens in
the first place? Can it be connected to upgrading?
Cheers,
Olav
On 01/01/15 03:07, tbenzvi@xxxxxxxxxxxxxxx
wrote:
No, the files can be read on a newly mounted
client! I went ahead and deleted all of the link
files associated with these duplicates, and then
remounted the volume. The problem is fixed!
Thanks again for the help, Joe and Vijay.
Tom
---------
Original Message ---------
Subject: Re: Hundreds of
duplicate files
From: "Vijay Bellur" <vbellur@xxxxxxxxxx>
Date: 12/28/14 3:23 am
To: tbenzvi@xxxxxxxxxxxxxxx,
gluster-users@xxxxxxxxxxx
On 12/28/2014 01:20 PM, tbenzvi@xxxxxxxxxxxxxxx
wrote:
> Hi Vijay,
> Yes the files are still readable from the
.glusterfs path.
> There is no explicit error. However,
trying to read a text file in
> python simply gives me null characters:
>
> >>>
open('ott_mf_itab').readlines()
>
['\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00']
>
> And reading binary files does the same
>
Is this behavior seen with a freshly mounted
client too?
-Vijay
> --------- Original Message ---------
> Subject: Re: Hundreds of
duplicate files
> From: "Vijay Bellur" <vbellur@xxxxxxxxxx>
> Date: 12/27/14 9:57 pm
> To: tbenzvi@xxxxxxxxxxxxxxx,
gluster-users@xxxxxxxxxxx
>
> On 12/28/2014 10:13 AM, tbenzvi@xxxxxxxxxxxxxxx
wrote:
> > Thanks Joe, I've read your blog post
as well as your post
> regarding the
> > .glusterfs directory.
> > I found some unneeded duplicate
files which were not being read
> > properly. I then deleted the link
file from the brick. This always
> > removes the duplicate file from the
listing, but the file does not
> > always become readable. If I also
delete the associated file in the
> > .glusterfs directory on that brick,
then some more files become
> > readable. However this solution
still doesn't work for all files.
> > I know the file on the brick is not
corrupt as it can be read
> directly
> > from the brick directory.
>
> For files that are not readable from the
client, can you check if the
> file is readable from the .glusterfs/
path?
>
> What is the specific error that is seen
while trying to read one such
> file from the client?
>
> Thanks,
> Vijay
>
>
>
>
_______________________________________________
> Gluster-users mailing list
> Gluster-users@xxxxxxxxxxx
> http://www.gluster.org/mailman/listinfo/gluster-users
>
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users
|