Rebuild Distributed/Replicated Setup

remi at goclio.com (Remi Broemeling) · Wed, 18 May 2011 08:51:33 -0600

Sure,

These files are just a sampling -- a lot of other files are showing the same
"split-brain" behaviour.

[14:42:45][root at web01:/var/glusterfs/bricks/shared]# getfattr -d -m
"trusted.afr*" agc/production/log/809223185/contact.log
# file: agc/production/log/809223185/contact.log
trusted.afr.shared-application-data-client-0=0sAAAAAAAAAAAAAAAA
trusted.afr.shared-application-data-client-1=0sAAAABQAAAAAAAAAA
[14:45:15][root at web02:/var/glusterfs/bricks/shared]# getfattr -d -m
"trusted.afr*" agc/production/log/809223185/contact.log
# file: agc/production/log/809223185/contact.log
trusted.afr.shared-application-data-client-0=0sAAACOwAAAAAAAAAA
trusted.afr.shared-application-data-client-1=0sAAAAAAAAAAAAAAAA

[14:42:53][root at web01:/var/glusterfs/bricks/shared]# getfattr -d -m
"trusted.afr*" agc/production/log/809223185/event.log
# file: agc/production/log/809223185/event.log
trusted.afr.shared-application-data-client-0=0sAAAAAAAAAAAAAAAA
trusted.afr.shared-application-data-client-1=0sAAAADgAAAAAAAAAA
[14:45:24][root at web02:/var/glusterfs/bricks/shared]# getfattr -d -m
"trusted.afr*" agc/production/log/809223185/event.log
# file: agc/production/log/809223185/event.log
trusted.afr.shared-application-data-client-0=0sAAAGXQAAAAAAAAAA
trusted.afr.shared-application-data-client-1=0sAAAAAAAAAAAAAAAA

[14:43:02][root at web01:/var/glusterfs/bricks/shared]# getfattr -d -m
"trusted.afr*" agc/production/log/809223635/contact.log
# file: agc/production/log/809223635/contact.log
trusted.afr.shared-application-data-client-0=0sAAAAAAAAAAAAAAAA
trusted.afr.shared-application-data-client-1=0sAAAACgAAAAAAAAAA
[14:45:28][root at web02:/var/glusterfs/bricks/shared]# getfattr -d -m
"trusted.afr*" agc/production/log/809223635/contact.log
# file: agc/production/log/809223635/contact.log
trusted.afr.shared-application-data-client-0=0sAAAELQAAAAAAAAAA
trusted.afr.shared-application-data-client-1=0sAAAAAAAAAAAAAAAA

[14:43:39][root at web01:/var/glusterfs/bricks/shared]# getfattr -d -m
"trusted.afr*" agc/production/log/809224061/contact.log
# file: agc/production/log/809224061/contact.log
trusted.afr.shared-application-data-client-0=0sAAAAAAAAAAAAAAAA
trusted.afr.shared-application-data-client-1=0sAAAACQAAAAAAAAAA
[14:45:32][root at web02:/var/glusterfs/bricks/shared]# getfattr -d -m
"trusted.afr*" agc/production/log/809224061/contact.log
# file: agc/production/log/809224061/contact.log
trusted.afr.shared-application-data-client-0=0sAAAD+AAAAAAAAAAA
trusted.afr.shared-application-data-client-1=0sAAAAAAAAAAAAAAAA

[14:43:42][root at web01:/var/glusterfs/bricks/shared]# getfattr -d -m
"trusted.afr*" agc/production/log/809224321/contact.log
# file: agc/production/log/809224321/contact.log
trusted.afr.shared-application-data-client-0=0sAAAAAAAAAAAAAAAA
trusted.afr.shared-application-data-client-1=0sAAAACAAAAAAAAAAA
[14:45:37][root at web02:/var/glusterfs/bricks/shared]# getfattr -d -m
"trusted.afr*" agc/production/log/809224321/contact.log
# file: agc/production/log/809224321/contact.log
trusted.afr.shared-application-data-client-0=0sAAAERAAAAAAAAAAA
trusted.afr.shared-application-data-client-1=0sAAAAAAAAAAAAAAAA

[14:43:45][root at web01:/var/glusterfs/bricks/shared]# getfattr -d -m
"trusted.afr*" agc/production/log/809215319/event.log
# file: agc/production/log/809215319/event.log
trusted.afr.shared-application-data-client-0=0sAAAAAAAAAAAAAAAA
trusted.afr.shared-application-data-client-1=0sAAAABwAAAAAAAAAA
[14:45:45][root at web02:/var/glusterfs/bricks/shared]# getfattr -d -m
"trusted.afr*" agc/production/log/809215319/event.log
# file: agc/production/log/809215319/event.log
trusted.afr.shared-application-data-client-0=0sAAAC/QAAAAAAAAAA
trusted.afr.shared-application-data-client-1=0sAAAAAAAAAAAAAAAA

On Wed, May 18, 2011 at 01:31, Pranith Kumar. Karampuri <
pranithk at gluster.com> wrote:

> hi Remi,
>      It seems the split-brain is detected on following files:
> /agc/production/log/809223185/contact.log
> /agc/production/log/809223185/event.log
> /agc/production/log/809223635/contact.log
> /agc/production/log/809224061/contact.log
> /agc/production/log/809224321/contact.log
> /agc/production/log/809215319/event.log
>
> Could you give the output of the following command for each file above on
> both the bricks in the replica pair.
>
> getxattr -d -m "trusted.afr*" <filepath>
>
> Thanks
> Pranith
>
> ----- Original Message -----
> From: "Remi Broemeling" <remi at goclio.com>
> To: gluster-users at gluster.org
> Sent: Tuesday, May 17, 2011 9:02:44 PM
> Subject: Re: Rebuild Distributed/Replicated Setup
>
>
> Hi Pranith. Sure, here is a pastebin sampling of logs from one of the
> hosts: http://pastebin.com/1U1ziwjC
>
>
> On Mon, May 16, 2011 at 20:48, Pranith Kumar. Karampuri <
> pranithk at gluster.com > wrote:
>
>
> hi Remi,
> Would it be possible to post the logs on the client, so that we can find
> what issue you are running into.
>
> Pranith
>
>
>
> ----- Original Message -----
> From: "Remi Broemeling" < remi at goclio.com >
> To: gluster-users at gluster.org
> Sent: Monday, May 16, 2011 10:47:33 PM
> Subject: Rebuild Distributed/Replicated Setup
>
>
> Hi,
>
> I've got a distributed/replicated GlusterFS v3.1.2 (installed via RPM)
> setup across two servers (web01 and web02) with the following vol config:
>
> volume shared-application-data-client-0
> type protocol/client
> option remote-host web01
> option remote-subvolume /var/glusterfs/bricks/shared
> option transport-type tcp
> option ping-timeout 5
> end-volume
>
> volume shared-application-data-client-1
> type protocol/client
> option remote-host web02
> option remote-subvolume /var/glusterfs/bricks/shared
> option transport-type tcp
> option ping-timeout 5
> end-volume
>
> volume shared-application-data-replicate-0
> type cluster/replicate
> subvolumes shared-application-data-client-0
> shared-application-data-client-1
> end-volume
>
> volume shared-application-data-write-behind
> type performance/write-behind
> subvolumes shared-application-data-replicate-0
> end-volume
>
> volume shared-application-data-read-ahead
> type performance/read-ahead
> subvolumes shared-application-data-write-behind
> end-volume
>
> volume shared-application-data-io-cache
> type performance/io-cache
> subvolumes shared-application-data-read-ahead
> end-volume
>
> volume shared-application-data-quick-read
> type performance/quick-read
> subvolumes shared-application-data-io-cache
> end-volume
>
> volume shared-application-data-stat-prefetch
> type performance/stat-prefetch
> subvolumes shared-application-data-quick-read
> end-volume
>
> volume shared-application-data
> type debug/io-stats
> subvolumes shared-application-data-stat-prefetch
> end-volume
>
> In total, four servers mount this via GlusterFS FUSE. For whatever reason
> (I'm really not sure why), the GlusterFS filesystem has run into a bit of
> split-brain nightmare (although to my knowledge an actual split brain
> situation has never occurred in this environment), and I have been getting
> solidly corrupted issues across the filesystem as well as complaints that
> the filesystem cannot be self-healed.
>
> What I would like to do is completely empty one of the two servers (here I
> am trying to empty server web01), making the other one (in this case web02)
> the authoritative source for the data; and then have web01 completely
> rebuild it's mirror directly from web02.
>
> What's the easiest/safest way to do this? Is there a command that I can run
> that will force web01 to re-initialize it's mirror directly from web02 (and
> thus completely eradicate all of the split-brain errors and data
> inconsistencies)?
>
> Thanks!
>

-- 
Remi Broemeling
System Administrator
Clio - Practice Management Simplified
1-888-858-2546 x(2^5) | remi at goclio.com
www.goclio.com | blog <http://www.goclio.com/blog> |
twitter<http://www.twitter.com/goclio>
 | facebook <http://www.facebook.com/goclio>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gluster.org/pipermail/gluster-users/attachments/20110518/c2cb81e6/attachment-0001.htm>