Re: Advise on recovering from a bad replica please

John Gardeniers <jgardeniers@xxxxxxxxxxxxxxxxx> · Fri, 27 Jun 2014 08:43:20 +1000

Hi Pranith,

We're running 3.4.2. I've attached the report from both servers. I don't
know why there's such a massive difference in the file sizes.

Regards,
John

On 27/06/14 03:32, Pranith Kumar Karampuri wrote:
>
> On 06/26/2014 04:10 AM, John Gardeniers wrote:
>> Hi Pranith,
>>
>> jupiter currently has no gluster processes running.
>> jupiter.om.net:/gluster_backup is a geo-replica.
>>
>> [root@nix ~]# gluster volume info
>>   Volume Name: gluster-backup
>> Type: Distribute
>> Volume ID: 0905fb11-f95a-4533-ae1c-05be43a8fe1f
>> Status: Started
>> Number of Bricks: 1
>> Transport-type: tcp
>> Bricks:
>> Brick1: jupiter.om.net:/gluster_backup
>>   Volume Name: gluster-rhev
>> Type: Replicate
>> Volume ID: b210cba9-56d3-4e08-a4d0-2f1fe8a46435
>> Status: Started
>> Number of Bricks: 1 x 2 = 2
>> Transport-type: tcp
>> Bricks:
>> Brick1: jupiter.om.net:/gluster_brick_1
>> Brick2: nix.om.net:/gluster_brick_1
>> Options Reconfigured:
>> geo-replication.indexing: on
>> nfs.disable: on
>
> I am extremely sorry, I should have asked for this information also
> yesterday.
> 1) What is the version of gluster you are using? In 3.4.x there is
> this issue where if operations are happening on VM self-heal wouldn't
> start, which is not the case in 3.5 I believe. So it is important. I
> remembered it only in the morning.
>
> 2) I believe the number of files on the bricks should be very less
> considering it is a rhevm setup. Could you please also attach the
> output of
>
> For each brick
> find <brick-path> | xargs getfattr -d -m. -e hex >
> file-you-need-to-send-us.txt
>
> This should help us see the xattrs of the files to help you on how to
> fix the split-brains where necessary.
>
> Pranith
>
>> regards,
>> John
>>
>>
>> On 25/06/14 19:05, Pranith Kumar Karampuri wrote:
>>> On 06/25/2014 04:29 AM, John Gardeniers wrote:
>>>> Hi All,
>>>>
>>>> We're using Gluster as the storage for our virtualization. This
>>>> consists
>>>> of 2 servers with a single brick each configured as a replica pair. We
>>>> also have a geo-replica on one of those two servers.
>>>>
>>>> For reasons that don't really matter, last weekend we had a situation
>>>> which cause one server to reboot a number of times, which in turn
>>>> resulted in a lot of heal-failed and split-brain errors. Because at
>>>> the
>>>> same time VMs were being migrated across hosts we ended up with many
>>>> crashed VMs.
>>>>
>>>> Due to the need get the VMs up and running with as quickly as possible
>>>> we decided to shut down one Gluster replica and use the "primary" one
>>>> alone. As the geo-replica is also on the node we shut down that leaves
>>>> us with just a single copy, which makes us rather nervous.
>>>>
>>>> As we have decided to treat the files on the currently running node as
>>>> "correct", I'd appreciate advise on the best way to get the other node
>>>> back into the replication. Should we simply bring it back on line and
>>>> try to correct the errors that I expect will be many or should we
>>>> treat
>>>> it as a failed server and bring it back with an empty brick, rather
>>>> than
>>>> what is currently in the existing brick? The volume/bricks are 5TB, of
>>>> which we're currently using around 2TB and the servers are on a 10Gb
>>>> network, so I imagine it shouldn't take too long to rebuild and this
>>>> would all be done out of hours anyway.
>>> Considering you are saying there were split-brain related errors as
>>> well. I suggest you bring up empty brick.
>>> Could you give "gluster volume info" output and tell me which brick
>>> went down. Based on that I will tell you
>>> what you need to do.
>>>
>>> Pranith
>>>> regards,
>>>> John
>>>>
>>>> _______________________________________________
>>>> Gluster-users mailing list
>>>> Gluster-users@xxxxxxxxxxx
>>>> http://supercolony.gluster.org/mailman/listinfo/gluster-users
>>>
>>> ______________________________________________________________________
>>> This email has been scanned by the Symantec Email Security.cloud
>>> service.
>>> For more information please visit http://www.symanteccloud.com
>>> ______________________________________________________________________
>
>
> ______________________________________________________________________
> This email has been scanned by the Symantec Email Security.cloud service.
> For more information please visit http://www.symanteccloud.com
> ______________________________________________________________________

Attachment:
nix_jupiter.tar.gz

Description: application/gzip
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-users