Re: Advise on recovering from a bad replica please

Pranith Kumar Karampuri <pkarampu@xxxxxxxxxx> · Thu, 26 Jun 2014 23:02:50 +0530

On 06/26/2014 04:10 AM, John Gardeniers wrote:
Hi Pranith,

jupiter currently has no gluster processes running.
jupiter.om.net:/gluster_backup is a geo-replica.

[root@nix ~]# gluster volume info

Volume Name: gluster-backup
Type: Distribute
Volume ID: 0905fb11-f95a-4533-ae1c-05be43a8fe1f
Status: Started
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: jupiter.om.net:/gluster_backup

Volume Name: gluster-rhev
Type: Replicate
Volume ID: b210cba9-56d3-4e08-a4d0-2f1fe8a46435
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: jupiter.om.net:/gluster_brick_1
Brick2: nix.om.net:/gluster_brick_1
Options Reconfigured:
geo-replication.indexing: on
nfs.disable: on

I am extremely sorry, I should have asked for this information also 
yesterday.
1) What is the version of gluster you are using? In 3.4.x there is this 
issue where if operations are happening on VM self-heal wouldn't start, 
which is not the case in 3.5 I believe. So it is important. I remembered 
it only in the morning.

2) I believe the number of files on the bricks should be very less 
considering it is a rhevm setup. Could you please also attach the output of

For each brick
find <brick-path> | xargs getfattr -d -m. -e hex > 
file-you-need-to-send-us.txt

This should help us see the xattrs of the files to help you on how to 
fix the split-brains where necessary.

Pranith

regards,
John

On 25/06/14 19:05, Pranith Kumar Karampuri wrote:
On 06/25/2014 04:29 AM, John Gardeniers wrote:
Hi All,

We're using Gluster as the storage for our virtualization. This consists
of 2 servers with a single brick each configured as a replica pair. We
also have a geo-replica on one of those two servers.

For reasons that don't really matter, last weekend we had a situation
which cause one server to reboot a number of times, which in turn
resulted in a lot of heal-failed and split-brain errors. Because at the
same time VMs were being migrated across hosts we ended up with many
crashed VMs.

Due to the need get the VMs up and running with as quickly as possible
we decided to shut down one Gluster replica and use the "primary" one
alone. As the geo-replica is also on the node we shut down that leaves
us with just a single copy, which makes us rather nervous.

As we have decided to treat the files on the currently running node as
"correct", I'd appreciate advise on the best way to get the other node
back into the replication. Should we simply bring it back on line and
try to correct the errors that I expect will be many or should we treat
it as a failed server and bring it back with an empty brick, rather than
what is currently in the existing brick? The volume/bricks are 5TB, of
which we're currently using around 2TB and the servers are on a 10Gb
network, so I imagine it shouldn't take too long to rebuild and this
would all be done out of hours anyway.
Considering you are saying there were split-brain related errors as
well. I suggest you bring up empty brick.
Could you give "gluster volume info" output and tell me which brick
went down. Based on that I will tell you
what you need to do.

Pranith
regards,
John

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-users

______________________________________________________________________
This email has been scanned by the Symantec Email Security.cloud service.
For more information please visit http://www.symanteccloud.com
______________________________________________________________________

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-users