Re: Advise on recovering from a bad replica please

John Gardeniers <jgardeniers@xxxxxxxxxxxxxxxxx> · Thu, 26 Jun 2014 08:40:25 +1000

Hi Pranith,

jupiter currently has no gluster processes running.
jupiter.om.net:/gluster_backup is a geo-replica.

[root@nix ~]# gluster volume info

Volume Name: gluster-backup
Type: Distribute
Volume ID: 0905fb11-f95a-4533-ae1c-05be43a8fe1f
Status: Started
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: jupiter.om.net:/gluster_backup

Volume Name: gluster-rhev
Type: Replicate
Volume ID: b210cba9-56d3-4e08-a4d0-2f1fe8a46435
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: jupiter.om.net:/gluster_brick_1
Brick2: nix.om.net:/gluster_brick_1
Options Reconfigured:
geo-replication.indexing: on
nfs.disable: on

regards,
John

On 25/06/14 19:05, Pranith Kumar Karampuri wrote:
>
> On 06/25/2014 04:29 AM, John Gardeniers wrote:
>> Hi All,
>>
>> We're using Gluster as the storage for our virtualization. This consists
>> of 2 servers with a single brick each configured as a replica pair. We
>> also have a geo-replica on one of those two servers.
>>
>> For reasons that don't really matter, last weekend we had a situation
>> which cause one server to reboot a number of times, which in turn
>> resulted in a lot of heal-failed and split-brain errors. Because at the
>> same time VMs were being migrated across hosts we ended up with many
>> crashed VMs.
>>
>> Due to the need get the VMs up and running with as quickly as possible
>> we decided to shut down one Gluster replica and use the "primary" one
>> alone. As the geo-replica is also on the node we shut down that leaves
>> us with just a single copy, which makes us rather nervous.
>>
>> As we have decided to treat the files on the currently running node as
>> "correct", I'd appreciate advise on the best way to get the other node
>> back into the replication. Should we simply bring it back on line and
>> try to correct the errors that I expect will be many or should we treat
>> it as a failed server and bring it back with an empty brick, rather than
>> what is currently in the existing brick? The volume/bricks are 5TB, of
>> which we're currently using around 2TB and the servers are on a 10Gb
>> network, so I imagine it shouldn't take too long to rebuild and this
>> would all be done out of hours anyway.
> Considering you are saying there were split-brain related errors as
> well. I suggest you bring up empty brick.
> Could you give "gluster volume info" output and tell me which brick
> went down. Based on that I will tell you
> what you need to do.
>
> Pranith
>>
>> regards,
>> John
>>
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users@xxxxxxxxxxx
>> http://supercolony.gluster.org/mailman/listinfo/gluster-users
>
>
> ______________________________________________________________________
> This email has been scanned by the Symantec Email Security.cloud service.
> For more information please visit http://www.symanteccloud.com
> ______________________________________________________________________

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-users