On 01/06/19 9:37 PM, Alan Orth wrote:
Dear Ravi,
The .glusterfs hardlinks/symlinks should be fine. I'm not
sure how I could verify them for six bricks and millions of
files, though... :\
Hi Alan,
The reason I asked this is because you had mentioned in one of
your earlier emails that when you moved content from the old brick
to the new one, you had skipped the .glusterfs directory. So I was
assuming that when you added back this new brick to the cluster,
it might have been missing the .glusterfs entries. If that is the
cae, one way to verify could be to check using a script if all
files on the brick have a link-count of at least 2 and all dirs
have valid symlinks inside .glusterfs pointing to themselves.
I had a small success in fixing some issues with duplicated
files on the FUSE mount point yesterday. I read quite a bit
about the elastic hashing algorithm that determines which
files get placed on which bricks based on the hash of their
filename and the trusted.glusterfs.dht xattr on brick
directories (thanks to Joe Julian's blog post and Python
script for showing how it works¹). With that knowledge I
looked closer at one of the files that was appearing as
duplicated on the FUSE mount and found that it was also
duplicated on more than `replica 2` bricks. For this
particular file I found two "real" files and several zero-size
files with trusted.glusterfs.dht.linkto xattrs. Neither of the
"real" files were on the correct brick as far as the DHT
layout is concerned, so I copied one of them to the correct
brick, deleted the others and their hard links, and did a
`stat` on the file from the FUSE mount point and it fixed
itself. Yay!
Could this have been caused by a replace-brick that got
interrupted and didn't finish re-labeling the xattrs?
No, replace-brick only initiates AFR self-heal, which just copies
the contents from the other brick(s) of the *same* replica pair into
the replaced brick. The link-to files are created by DHT when you
rename a file from the client. If the new name hashes to a
different brick, DHT does not move the entire file there. It
instead creates the link-to file (the one with the dht.linkto
xattrs) on the hashed subvol. The value of this xattr points to the
brick where the actual data is there (`getfattr -e text` to see it
for yourself). Perhaps you had attempted a rebalance or
remove-brick earlier and interrupted that?
Should I be thinking of some heuristics to identify and fix
these issues with a script (incorrect brick placement), or is
this something a fix layout or repeated volume heals can fix?
I've already completed a whole heal on this particular volume
this week and it did heal about 1,000,000 files (mostly data
and metadata, but about 20,000 entry heals as well).
Maybe you should let the AFR self-heals complete first and then
attempt a full rebalance to take care of the dht link-to files.
But if the files are in millions, it could take quite some time
to complete.
Regards,
Ravi
On
31/05/19 3:20 AM, Alan Orth wrote:
Dear Ravi,
I spent a bit of time inspecting the xattrs on some
files and directories on a few bricks for this volume
and it looks a bit messy. Even if I could make sense
of it for a few and potentially heal them manually,
there are millions of files and directories in total
so that's definitely not a scalable solution. After a
few missteps with `replace-brick ... commit force` in
the last week—one of which on a brick that was
dead/offline—as well as some premature `remove-brick`
commands, I'm unsure how how to proceed and I'm
getting demotivated. It's scary how quickly things get
out of hand in distributed systems...
Hi Alan,
The one good thing about gluster is it that the data is
always available directly on the backed bricks even if your
volume has inconsistencies at the gluster level. So
theoretically, if your cluster is FUBAR, you could just
create a new volume and copy all data onto it via its mount
from the old volume's bricks.
I had hoped that bringing the old brick back up
would help, but by the time I added it again a few
days had passed and all the brick-id's had changed
due to the replace/remove brick commands, not to
mention that the trusted.afr.$volume-client-xx
values were now probably pointing to the wrong
bricks (?).
Anyways, a few hours ago I started a full heal on
the volume and I see that there is a sustained
100MiB/sec of network traffic going from the old
brick's host to the new one. The completed heals
reported in the logs look promising too:
Old brick host:
# grep '2019-05-30'
/var/log/glusterfs/glustershd.log | grep -o -E
'Completed (data|metadata|entry) selfheal' | sort |
uniq -c
281614 Completed data selfheal
84 Completed entry selfheal
299648 Completed metadata selfheal
New brick host:
# grep '2019-05-30'
/var/log/glusterfs/glustershd.log | grep -o -E
'Completed (data|metadata|entry) selfheal' | sort |
uniq -c
198256 Completed data selfheal
16829 Completed entry selfheal
229664 Completed metadata selfheal
So that's good I guess, though I have no idea how
long it will take or if it will fix the "missing
files" issue on the FUSE mount. I've increased
cluster.shd-max-threads to 8 to hopefully speed up the
heal process.
The afr xattrs should not cause files to disappear from
mount. If the xattr names do not match what each AFR subvol
expects (for eg. in a replica 2 volume,
trusted.afr.*-client-{0,1} for 1st subvol, client-{2,3} for
2nd subvol and so on - ) for its children then it won't heal
the data, that is all. But in your case I see some
inconsistencies like one brick having the actual file ( licenseserver.cfg)
and the other having a linkto file (the one with the dht.linkto
xattr) in the same replica pair.
I'd be happy for any advice or pointers,
Did you check if the .glusterfs hardlinks/symlinks exist
and are in order for all bricks?
-Ravi
Dear Ravi,
Thank you for the link to the blog post
series—it is very informative and current! If I
understand your blog post correctly then I think
the answer to your previous question about pending
AFRs is: no, there are no pending AFRs. I have
identified one file that is a good test case to
try to understand what happened after I issued the
`gluster volume replace-brick ... commit force` a
few days ago and then added the same original
brick back to the volume later. This is the
current state of the replica 2
distribute/replicate volume:
[root@wingu0
~]# gluster volume info apps
Volume Name: apps
Type: Distributed-Replicate
Volume ID: f118d2da-79df-4ee1-919d-53884cd34eda
Status: Started
Snapshot Count: 0
Number of Bricks: 3 x 2 = 6
Transport-type: tcp
Bricks:
Brick1: wingu3:/mnt/gluster/apps
Brick2: wingu4:/mnt/gluster/apps
Brick3: wingu05:/data/glusterfs/sdb/apps
Brick4: wingu06:/data/glusterfs/sdb/apps
Brick5: wingu0:/mnt/gluster/apps
Brick6: wingu05:/data/glusterfs/sdc/apps
Options Reconfigured:
diagnostics.client-log-level: DEBUG
storage.health-check-interval: 10
nfs.disable: on
I checked the xattrs of one file that is
missing from the volume's FUSE mount (though I can
read it if I access its full path explicitly), but
is present in several of the volume's bricks (some
with full size, others empty):
[root@wingu0
~]# getfattr -d -m. -e hex
/mnt/gluster/apps/clcgenomics/clclicsrv/licenseserver.cfg
getfattr: Removing leading '/' from absolute path names
# file: mnt/gluster/apps/clcgenomics/clclicsrv/licenseserver.cfg
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.afr.apps-client-3=0x000000000000000000000000
trusted.afr.apps-client-5=0x000000000000000000000000
trusted.afr.dirty=0x000000000000000000000000
trusted.bit-rot.version=0x0200000000000000585a396f00046e15
trusted.gfid=0x878003a2fb5243b6a0d14d2f8b4306bd
[root@wingu05 ~]# getfattr -d -m. -e hex /data/glusterfs/sdb/apps/clcgenomics/clclicsrv/licenseserver.cfg
getfattr: Removing leading '/' from absolute path names
# file: data/glusterfs/sdb/apps/clcgenomics/clclicsrv/licenseserver.cfg
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.gfid=0x878003a2fb5243b6a0d14d2f8b4306bd
trusted.gfid2path.82586deefbc539c3=0x34666437323861612d356462392d343836382d616232662d6564393031636566333561392f6c6963656e73657365727665722e636667
trusted.glusterfs.dht.linkto=0x617070732d7265706c69636174652d3200
[root@wingu05 ~]# getfattr -d -m. -e hex /data/glusterfs/sdc/apps/clcgenomics/clclicsrv/licenseserver.cfg
getfattr: Removing leading '/' from absolute path names
# file: data/glusterfs/sdc/apps/clcgenomics/clclicsrv/licenseserver.cfg
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.gfid=0x878003a2fb5243b6a0d14d2f8b4306bd
trusted.gfid2path.82586deefbc539c3=0x34666437323861612d356462392d343836382d616232662d6564393031636566333561392f6c6963656e73657365727665722e636667
[root@wingu06 ~]# getfattr -d -m. -e hex /data/glusterfs/sdb/apps/clcgenomics/clclicsrv/licenseserver.cfg
getfattr: Removing leading '/' from absolute path names
# file: data/glusterfs/sdb/apps/clcgenomics/clclicsrv/licenseserver.cfg
security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000
trusted.gfid=0x878003a2fb5243b6a0d14d2f8b4306bd
trusted.gfid2path.82586deefbc539c3=0x34666437323861612d356462392d343836382d616232662d6564393031636566333561392f6c6963656e73657365727665722e636667
trusted.glusterfs.dht.linkto=0x617070732d7265706c69636174652d3200
According to the trusted.afr.apps-client-xx xattrs this
particular file should be on bricks with id
"apps-client-3" and "apps-client-5". It took
me a few hours to realize that the brick-id
values are recorded in the volume's volfiles
in /var/lib/glusterd/vols/apps/bricks. After
comparing those brick-id values with a volfile
backup from before the replace-brick, I
realized that the files are simply on the
wrong brick now as far as Gluster is
concerned. This
particular file is now on the brick for
"apps-client-4". As an
experiment I copied this one file to the two
bricks listed in the xattrs and I was then able to
see the file from the FUSE mount (yay!).
Other than replacing the brick, removing it,
and then adding the old brick on the original
server back, there has been no change in the data
this entire time. Can I change the brick IDs in
the volfiles so they reflect where the data
actually is? Or perhaps script something to reset
all the xattrs on the files/directories to point
to the correct bricks?
Thank you for any help or pointers,
On
29/05/19 9:50 AM, Ravishankar N wrote:
On
29/05/19 3:59 AM, Alan Orth wrote:
Dear Ravishankar,
I'm not sure if Brick4 had pending
AFRs because I don't know what that
means and it's been a few days so I am
not sure I would be able to find that
information.
When you find some time, have a look at a blog series I
wrote about AFR- I've tried to explain what
one needs to know to debug replication related
issues in it.
Made a typo error. The URL for the blog is https://wp.me/peiBB-6b
-Ravi
Anyways, after wasting a few days
rsyncing the old brick to a new host I
decided to just try to add the old brick
back into the volume instead of bringing
it up on the new host. I created a new
brick directory on the old host, moved
the old brick's contents into that new
directory (minus the .glusterfs
directory), added the new brick to the
volume, and then did Vlad's find/stat
trick¹ from the brick to the FUSE mount
point.
The interesting problem I have now is
that some files don't appear in the FUSE
mount's directory listings, but I can
actually list them directly and even
read them. What could cause that?
Not sure, too many variables in the hacks that
you did to take a guess. You can check if the
contents of the .glusterfs folder are in order
on the new brick (example hardlink for files
and symlinks for directories are present etc.)
.
Regards,
Ravi
On
23/05/19 2:40 AM, Alan Orth wrote:
Dear list,
I seem to have gotten into a
tricky situation. Today I
brought up a shiny new server
with new disk arrays and
attempted to replace one brick
of a replica 2
distribute/replicate volume on
an older server using the
`replace-brick` command:
# gluster volume
replace-brick homes
wingu0:/mnt/gluster/homes
wingu06:/data/glusterfs/sdb/homes
commit force
The command was successful
and I see the new brick in the
output of `gluster volume info`.
The problem is that Gluster
doesn't seem to be migrating the
data,
`replace-brick` definitely must
heal (not migrate) the data. In your
case, data must have been healed
from Brick-4 to the replaced
Brick-3. Are there any errors in the
self-heal daemon logs of Brick-4's
node? Does Brick-4 have pending AFR
xattrs blaming Brick-3? The doc is a
bit out of date. replace-brick
command internally does all the
setfattr steps that are mentioned in
the doc.
-Ravi
and now the original brick
that I replaced is no longer
part of the volume (and a few
terabytes of data are just
sitting on the old brick):
# gluster volume info homes |
grep -E "Brick[0-9]:"
Brick1:
wingu4:/mnt/gluster/homes
Brick2:
wingu3:/mnt/gluster/homes
Brick3:
wingu06:/data/glusterfs/sdb/homes
Brick4:
wingu05:/data/glusterfs/sdb/homes
Brick5:
wingu05:/data/glusterfs/sdc/homes
Brick6:
wingu06:/data/glusterfs/sdc/homes
I see the Gluster docs have a
more complicated procedure for
replacing bricks that involves
getfattr/setfattr¹. How can I
tell Gluster about the old
brick? I see that I have a
backup of the old volfile thanks
to yum's rpmsave function if
that helps.
We are using Gluster 5.6 on
CentOS 7. Thank you for any
advice you can give.
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users
--
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users
--
--
--
|