Re: Gluster replicate 3 arbiter 1 in split brain. gluster cli seems unaware

Henrik Juul Pedersen <hjp@xxxxxxx> · Fri, 22 Dec 2017 13:31:22 +0100

Hi Karthik,

Thanks for the info. Maybe the documentation should be updated to
explain the different AFR versions, I know I was confused.

Also, when looking at the changelogs from my three bricks before fixing:

Brick 1:
trusted.afr.virt_images-client-1=0x000002280000000000000000
trusted.afr.virt_images-client-3=0x000000000000000000000000

Brick 2:
trusted.afr.virt_images-client-2=0x000003ef0000000000000000
trusted.afr.virt_images-client-3=0x000000000000000000000000

Brick 3 (arbiter):
trusted.afr.virt_images-client-1=0x000002280000000000000000

I would think that the changelog for client 1 should win by majority
vote? Or how does the self-healing process work?
I assumed this as the correct version, and reset client 2 on brick 2:
# setfattr -n trusted.afr.virt_images-client-2 -v
0x000000000000000000000000 fedora27.qcow2

I then did a directory listing, which might have started a heal, but
heal statistics show (i also did a full heal):
Starting time of crawl: Fri Dec 22 11:34:47 2017

Ending time of crawl: Fri Dec 22 11:34:47 2017

Type of crawl: INDEX
No. of entries healed: 0
No. of entries in split-brain: 0
No. of heal failed entries: 1

Starting time of crawl: Fri Dec 22 11:39:29 2017

Ending time of crawl: Fri Dec 22 11:39:29 2017

Type of crawl: FULL
No. of entries healed: 0
No. of entries in split-brain: 0
No. of heal failed entries: 1

I was immediately able to touch the file, so gluster was okay about
it, however heal info still showed the file for a while:
# gluster volume heal virt_images info
Brick virt3:/data/virt_images/brick
/fedora27.qcow2
Status: Connected
Number of entries: 1

Brick virt2:/data/virt_images/brick
/fedora27.qcow2
Status: Connected
Number of entries: 1

Brick printserver:/data/virt_images/brick
/fedora27.qcow2
Status: Connected
Number of entries: 1

Now heal info shows 0 entries, and the two data bricks have the same
md5sum, so it's back in sync.

I have a few questions after all of this:

1) How can a split brain happen in a replica 3 arbiter 1 setup with
both server- and client quorum enabled?
2) Why was it not able to self heal, when tro bricks seemed in sync
with their changelogs?
3) Why could I not see the file in heal info split-brain?
4) Why could I not fix this through the cli split-brain resolution tool?
5) Is it possible to force a sync in a volume? Or maybe test sync
status? It might be smart to be able to "flush" changes when taking a
brick down for maintenance.
6) How am I supposed to monitor events like this? I have a gluster
volume with ~500.000 files, I need to be able to guarantee data
integrity and availability to the users.
7) Is glusterfs "production ready"? Because I find it hard to monitor
and thus trust in these setups. Also performance with small / many
files seems horrible at best - but that's for another discussion.

Thanks for all of your help, Ill continue to try and tweak some
performance out of this. :)

Best regards,
Henrik Juul Pedersen
LIAB ApS

On 22 December 2017 at 07:26, Karthik Subrahmanya <ksubrahm@xxxxxxxxxx> wrote:
> Hi Henrik,
>
> Thanks for providing the required outputs. See my replies inline.
>
> On Thu, Dec 21, 2017 at 10:42 PM, Henrik Juul Pedersen <hjp@xxxxxxx> wrote:
>>
>> Hi Karthik and Ben,
>>
>> I'll try and reply to you inline.
>>
>> On 21 December 2017 at 07:18, Karthik Subrahmanya <ksubrahm@xxxxxxxxxx>
>> wrote:
>> > Hey,
>> >
>> > Can you give us the volume info output for this volume?
>>
>> # gluster volume info virt_images
>>
>> Volume Name: virt_images
>> Type: Replicate
>> Volume ID: 9f3c8273-4d9d-4af2-a4e7-4cb4a51e3594
>> Status: Started
>> Snapshot Count: 2
>> Number of Bricks: 1 x (2 + 1) = 3
>> Transport-type: tcp
>> Bricks:
>> Brick1: virt3:/data/virt_images/brick
>> Brick2: virt2:/data/virt_images/brick
>> Brick3: printserver:/data/virt_images/brick (arbiter)
>> Options Reconfigured:
>> features.quota-deem-statfs: on
>> features.inode-quota: on
>> features.quota: on
>> features.barrier: disable
>> features.scrub: Active
>> features.bitrot: on
>> nfs.rpc-auth-allow: on
>> server.allow-insecure: on
>> user.cifs: off
>> features.shard: off
>> cluster.shd-wait-qlength: 10000
>> cluster.locking-scheme: granular
>> cluster.data-self-heal-algorithm: full
>> cluster.server-quorum-type: server
>> cluster.quorum-type: auto
>> cluster.eager-lock: enable
>> network.remote-dio: enable
>> performance.low-prio-threads: 32
>> performance.io-cache: off
>> performance.read-ahead: off
>> performance.quick-read: off
>> nfs.disable: on
>> transport.address-family: inet
>> server.outstanding-rpc-limit: 512
>>
>> > Why are you not able to get the xattrs from arbiter brick? It is the
>> > same
>> > way as you do it on data bricks.
>>
>> Yes I must have confused myself yesterday somehow, here it is in full
>> from all three bricks:
>>
>> Brick 1 (virt2): # getfattr -d -m . -e hex fedora27.qcow2
>> # file: fedora27.qcow2
>> trusted.afr.dirty=0x000000000000000000000000
>> trusted.afr.virt_images-client-1=0x000002280000000000000000
>> trusted.afr.virt_images-client-3=0x000000000000000000000000
>> trusted.bit-rot.version=0x1d000000000000005a3aa0db000c6563
>> trusted.gfid=0x7a36937d52fc4b55a93299e2328f02ba
>>
>> trusted.gfid2path.c076c6ac27a43012=0x30303030303030302d303030302d303030302d303030302d3030303030303030303030312f6665646f726132372e71636f7732
>>
>> trusted.glusterfs.quota.00000000-0000-0000-0000-000000000001.contri.1=0x00000000a49eb0000000000000000001
>> trusted.pgfid.00000000-0000-0000-0000-000000000001=0x00000001
>>
>> Brick 2 (virt3): # getfattr -d -m . -e hex fedora27.qcow2
>> # file: fedora27.qcow2
>> trusted.afr.dirty=0x000000000000000000000000
>> trusted.afr.virt_images-client-2=0x000003ef0000000000000000
>> trusted.afr.virt_images-client-3=0x000000000000000000000000
>> trusted.bit-rot.version=0x19000000000000005a3a9f82000c382a
>> trusted.gfid=0x7a36937d52fc4b55a93299e2328f02ba
>>
>> trusted.gfid2path.c076c6ac27a43012=0x30303030303030302d303030302d303030302d303030302d3030303030303030303030312f6665646f726132372e71636f7732
>>
>> trusted.glusterfs.quota.00000000-0000-0000-0000-000000000001.contri.1=0x00000000a2fbe0000000000000000001
>> trusted.pgfid.00000000-0000-0000-0000-000000000001=0x00000001
>>
>> Brick 3 - arbiter (printserver): # getfattr -d -m . -e hex fedora27.qcow2
>> # file: fedora27.qcow2
>> trusted.afr.dirty=0x000000000000000000000000
>> trusted.afr.virt_images-client-1=0x000002280000000000000000
>> trusted.bit-rot.version=0x31000000000000005a39237200073206
>> trusted.gfid=0x7a36937d52fc4b55a93299e2328f02ba
>>
>> trusted.gfid2path.c076c6ac27a43012=0x30303030303030302d303030302d303030302d303030302d3030303030303030303030312f6665646f726132372e71636f7732
>>
>> trusted.glusterfs.quota.00000000-0000-0000-0000-000000000001.contri.1=0x00000000000000000000000000000001
>> trusted.pgfid.00000000-0000-0000-0000-000000000001=0x00000001
>>
>> I was expecting trusted.afr.virt_images-client-{1,2,3} on all bricks?
>
> From AFR-V2 we do not have  self blaming attrs. So you will see a brick
> blaming other bricks only.
> For example brcik1 can blame brick2 & brick 3, not itself.
>>
>>
>> > The changelog xattrs are named trusted.afr.virt_images-client-{1,2,3} in
>> > the
>> > getxattr outputs you have provided.
>> > Did you do a remove-brick and add-brick any time? Otherwise it will be
>> > trusted.afr.virt_images-client-{0,1,2} usually.
>>
>> Yes, the bricks was moved around initially; brick 0 was re-created as
>> brick 2, and the arbiter was added later on as well.
>>
>> >
>> > To overcome this scenario you can do what Ben Turner had suggested.
>> > Select
>> > the source copy and change the xattrs manually.
>>
>> I won't mind doing that, but again, the guides assume that I have
>> trusted.afr.virt_images-client-{1,2,3} on all bricks, so I'm not sure
>> what to change to what, where.
>>
>>
>> > I am suspecting that it has hit the arbiter becoming source for data
>> > heal
>> > bug. But to confirm that we need the xattrs on the arbiter brick also.
>> >
>> > Regards,
>> > Karthik
>> >
>> >
>> > On Thu, Dec 21, 2017 at 9:55 AM, Ben Turner <bturner@xxxxxxxxxx> wrote:
>> >>
>> >> Here is the process for resolving split brain on replica 2:
>> >>
>> >>
>> >>
>> >> https://access.redhat.com/documentation/en-US/Red_Hat_Storage/2.1/html/Administration_Guide/Recovering_from_File_Split-brain.html
>> >>
>> >> It should be pretty much the same for replica 3, you change the xattrs
>> >> with something like:
>> >>
>> >> # setfattr -n trusted.afr.vol-client-0 -v 0x000000000000000100000000
>> >> /gfs/brick-b/a
>> >>
>> >> When I try to decide which copy to use I normally run things like:
>> >>
>> >> # stat /<path to brick>/pat/to/file
>> >>
>> >> Check out the access and change times of the file on the back end
>> >> bricks.
>> >> I normally pick the copy with the latest access / change times.  I'll
>> >> also
>> >> check:
>> >>
>> >> # md5sum /<path to brick>/pat/to/file
>> >>
>> >> Compare the hashes of the file on both bricks to see if the data
>> >> actually
>> >> differs.  If the data is the same it makes choosing the proper replica
>> >> easier.
>>
>> The files on the bricks differ, so there was something changed, and
>> not replicated.
>>
>> Thanks for the input, I've looked at that, but couldn't get it to fit,
>> as I dont have trusted.afr.virt_images-client-{1,2,3} on all bricks.
>
> You can choose any one of the copy as good based on the latest ctime/mtime.
> Before doing anything keep the backup of both the copies, so that if
> something bad happens,
> you will have the data safe.
> Now choose one copy as good (based on timestamps/size/choosing a brick as
> source),
> and reset the xattrs set for that on other brick. Then do lookup on that
> file from the mount.
> That should resolve the issue.
> Once you are done, please let us know the result.
>
> Regards,
> Karthik
>>
>>
>> >>
>> >> Any idea how you got in this situation?  Did you have a loss of NW
>> >> connectivity?  I see you are using server side quorum, maybe check the
>> >> logs
>> >> for any loss of quorum?  I wonder if there was a loos of quorum and
>> >> there
>> >> was some sort of race condition hit:
>> >>
>> >>
>> >>
>> >> http://docs.gluster.org/en/latest/Administrator%20Guide/arbiter-volumes-and-quorum/#server-quorum-and-some-pitfalls
>> >>
>> >> "Unlike in client-quorum where the volume becomes read-only when quorum
>> >> is
>> >> lost, loss of server-quorum in a particular node makes glusterd kill
>> >> the
>> >> brick processes on that node (for the participating volumes) making
>> >> even
>> >> reads impossible."
>>
>> I might have had a loss of server quorum, but I cant seem to see
>> exactly why or when from the logs:
>>
>> Times are synchronized between servers. Virt 3 was rebooted for
>> service at 17:29:39. The shutdown logs show an issue with unmounting
>> the bricks, probably because glusterd was still running:
>> Dec 20 17:29:39 virt3 systemd[1]: Failed unmounting /data/virt_images.
>> Dec 20 17:29:39 virt3 systemd[1]: data-filserver.mount: Mount process
>> exited, code=exited status=32
>> Dec 20 17:29:39 virt3 systemd[1]: Failed unmounting /data/filserver.
>> Dec 20 17:29:39 virt3 systemd[1]: Unmounted /virt_images.
>> Dec 20 17:29:39 virt3 systemd[1]: Stopped target Network is Online.
>> Dec 20 17:29:39 virt3 systemd[1]: Stopping GlusterFS, a clustered
>> file-system server...
>> Dec 20 17:29:39 virt3 systemd[1]: Stopping Network Name Resolution...
>> Dec 20 17:29:39 virt3 systemd[1]: Stopped GlusterFS, a clustered
>> file-system server.
>>
>> I believe it was around this time, the virtual machine (running on
>> virt2) was stopped by qemu.
>>
>>
>> Brick 1 (virt2) only experienced loss of quorum when starting gluster
>> (glusterd.log confirms this):
>> Dec 20 17:22:03 virt2 systemd[1]: Starting GlusterFS, a clustered
>> file-system server...
>> Dec 20 17:22:05 virt2 glusterd[739]: [2017-12-20 16:22:05.997472] C
>> [MSGID: 106002]
>> [glusterd-server-quorum.c:355:glusterd_do_volume_quorum_action]
>> 0-management: Server quorum lost for volume filserver. Stopping local
>> bricks.
>> Dec 20 17:22:05 virt2 glusterd[739]: [2017-12-20 16:22:05.997666] C
>> [MSGID: 106002]
>> [glusterd-server-quorum.c:355:glusterd_do_volume_quorum_action]
>> 0-management: Server quorum lost for volume virt_images. Stopping
>> local bricks.
>> Dec 20 17:22:06 virt2 systemd[1]: Started GlusterFS, a clustered
>> file-system server.
>> Dec 20 17:22:11 virt2 glusterd[739]: [2017-12-20 16:22:11.387238] C
>> [MSGID: 106003]
>> [glusterd-server-quorum.c:349:glusterd_do_volume_quorum_action]
>> 0-management: Server quorum regained for volume filserver. Starting
>> local bricks.
>> Dec 20 17:22:11 virt2 glusterd[739]: [2017-12-20 16:22:11.390417] C
>> [MSGID: 106003]
>> [glusterd-server-quorum.c:349:glusterd_do_volume_quorum_action]
>> 0-management: Server quorum regained for volume virt_images. Starting
>> local bricks.
>> -- Reboot --
>> Dec 20 18:41:35 virt2 systemd[1]: Starting GlusterFS, a clustered
>> file-system server...
>> Dec 20 18:41:41 virt2 systemd[1]: Started GlusterFS, a clustered
>> file-system server.
>> Dec 20 18:41:43 virt2 glusterd[748]: [2017-12-20 17:41:43.387633] C
>> [MSGID: 106003]
>> [glusterd-server-quorum.c:349:glusterd_do_volume_quorum_action]
>> 0-management: Server quorum regained for volume filserver. Starting
>> local bricks.
>> Dec 20 18:41:43 virt2 glusterd[748]: [2017-12-20 17:41:43.391080] C
>> [MSGID: 106003]
>> [glusterd-server-quorum.c:349:glusterd_do_volume_quorum_action]
>> 0-management: Server quorum regained for volume virt_images. Starting
>> local bricks.
>>
>>
>> Brick 2 (virt3) shows a network outage on the 19th, but everything
>> worked fine afterwards:
>> Dec 19 13:11:34 virt3 glusterd[10058]: [2017-12-19 12:11:34.382207] C
>> [MSGID: 106003]
>> [glusterd-server-quorum.c:349:glusterd_do_volume_quorum_action]
>> 0-management: Server quorum regained for volume filserver. Starting
>> local bricks.
>> Dec 19 13:11:34 virt3 glusterd[10058]: [2017-12-19 12:11:34.387324] C
>> [MSGID: 106003]
>> [glusterd-server-quorum.c:349:glusterd_do_volume_quorum_action]
>> 0-management: Server quorum regained for volume virt_images. Starting
>> local bricks.
>> Dec 20 17:29:39 virt3 systemd[1]: Stopping GlusterFS, a clustered
>> file-system server...
>> Dec 20 17:29:39 virt3 systemd[1]: Stopped GlusterFS, a clustered
>> file-system server.
>> -- Reboot --
>> Dec 20 17:30:21 virt3 systemd[1]: Starting GlusterFS, a clustered
>> file-system server...
>> Dec 20 17:30:22 virt3 glusterd[394]: [2017-12-20 16:30:22.826828] C
>> [MSGID: 106002]
>> [glusterd-server-quorum.c:355:glusterd_do_volume_quorum_action]
>> 0-management: Server quorum lost for volume filserver. Stopping local
>> bricks.
>> Dec 20 17:30:22 virt3 glusterd[394]: [2017-12-20 16:30:22.827188] C
>> [MSGID: 106002]
>> [glusterd-server-quorum.c:355:glusterd_do_volume_quorum_action]
>> 0-management: Server quorum lost for volume virt_images. Stopping
>> local bricks.
>> Dec 20 17:30:23 virt3 systemd[1]: Started GlusterFS, a clustered
>> file-system server.
>> Dec 20 17:30:29 virt3 glusterd[394]: [2017-12-20 16:30:29.488000] C
>> [MSGID: 106003]
>> [glusterd-server-quorum.c:349:glusterd_do_volume_quorum_action]
>> 0-management: Server quorum regained for volume filserver. Starting
>> local bricks.
>> Dec 20 17:30:29 virt3 glusterd[394]: [2017-12-20 16:30:29.491446] C
>> [MSGID: 106003]
>> [glusterd-server-quorum.c:349:glusterd_do_volume_quorum_action]
>> 0-management: Server quorum regained for volume virt_images. Starting
>> local bricks.
>> Dec 20 18:31:06 virt3 systemd[1]: Stopping GlusterFS, a clustered
>> file-system server...
>> Dec 20 18:31:06 virt3 systemd[1]: Stopped GlusterFS, a clustered
>> file-system server.
>> -- Reboot --
>> Dec 20 18:31:46 virt3 systemd[1]: Starting GlusterFS, a clustered
>> file-system server...
>> Dec 20 18:31:46 virt3 glusterd[386]: [2017-12-20 17:31:46.958818] C
>> [MSGID: 106002]
>> [glusterd-server-quorum.c:355:glusterd_do_volume_quorum_action]
>> 0-management: Server quorum lost for volume filserver. Stopping local
>> bricks.
>> Dec 20 18:31:46 virt3 glusterd[386]: [2017-12-20 17:31:46.959168] C
>> [MSGID: 106002]
>> [glusterd-server-quorum.c:355:glusterd_do_volume_quorum_action]
>> 0-management: Server quorum lost for volume virt_images. Stopping
>> local bricks.
>> Dec 20 18:31:47 virt3 systemd[1]: Started GlusterFS, a clustered
>> file-system server.
>> Dec 20 18:33:10 virt3 glusterd[386]: [2017-12-20 17:33:10.156180] C
>> [MSGID: 106001]
>> [glusterd-volume-ops.c:1534:glusterd_op_stage_start_volume]
>> 0-management: Server quorum not met. Rejecting operation.
>> Dec 20 18:35:58 virt3 glusterd[386]: [2017-12-20 17:35:58.440395] C
>> [MSGID: 106003]
>> [glusterd-server-quorum.c:349:glusterd_do_volume_quorum_action]
>> 0-management: Server quorum regained for volume filserver. Starting
>> local bricks.
>> Dec 20 18:35:58 virt3 glusterd[386]: [2017-12-20 17:35:58.446203] C
>> [MSGID: 106003]
>> [glusterd-server-quorum.c:349:glusterd_do_volume_quorum_action]
>> 0-management: Server quorum regained for volume virt_images. Starting
>> local bricks.
>>
>> Brick 3 - arbiter (printserver) shows no loss of quorum at that time
>> (again, glusterd.log confirms):
>> Dec 19 15:33:24 printserver systemd[1]: Starting GlusterFS, a
>> clustered file-system server...
>> Dec 19 15:33:26 printserver glusterd[306]: [2017-12-19
>> 14:33:26.432369] C [MSGID: 106002]
>> [glusterd-server-quorum.c:355:glusterd_do_volume_quorum_action]
>> 0-management: Server quorum lost for volume filserver. Stopping local
>> bricks.
>> Dec 19 15:33:26 printserver glusterd[306]: [2017-12-19
>> 14:33:26.432606] C [MSGID: 106002]
>> [glusterd-server-quorum.c:355:glusterd_do_volume_quorum_action]
>> 0-management: Server quorum lost for volume virt_images. Stopping
>> local bricks.
>> Dec 19 15:33:26 printserver systemd[1]: Started GlusterFS, a clustered
>> file-system server.
>> Dec 19 15:34:18 printserver glusterd[306]: [2017-12-19
>> 14:34:18.158756] C [MSGID: 106003]
>> [glusterd-server-quorum.c:349:glusterd_do_volume_quorum_action]
>> 0-management: Server quorum regained for volume filserver. Starting
>> local bricks.
>> Dec 19 15:34:18 printserver glusterd[306]: [2017-12-19
>> 14:34:18.162242] C [MSGID: 106003]
>> [glusterd-server-quorum.c:349:glusterd_do_volume_quorum_action]
>> 0-management: Server quorum regained for volume virt_images. Starting
>> local bricks.
>> Dec 20 18:28:52 printserver systemd[1]: Stopping GlusterFS, a
>> clustered file-system server...
>> Dec 20 18:28:52 printserver systemd[1]: Stopped GlusterFS, a clustered
>> file-system server.
>> -- Reboot --
>> Dec 20 18:30:40 printserver systemd[1]: Starting GlusterFS, a
>> clustered file-system server...
>> Dec 20 18:30:42 printserver glusterd[278]: [2017-12-20
>> 17:30:42.441675] C [MSGID: 106002]
>> [glusterd-server-quorum.c:355:glusterd_do_volume_quorum_action]
>> 0-management: Server quorum lost for volume filserver. Stopping local
>> bricks.
>> Dec 20 18:30:42 printserver glusterd[278]: [2017-12-20
>> 17:30:42.441929] C [MSGID: 106002]
>> [glusterd-server-quorum.c:355:glusterd_do_volume_quorum_action]
>> 0-management: Server quorum lost for volume virt_images. Stopping
>> local bricks.
>> Dec 20 18:30:42 printserver systemd[1]: Started GlusterFS, a clustered
>> file-system server.
>> Dec 20 18:33:49 printserver glusterd[278]: [2017-12-20
>> 17:33:49.005534] C [MSGID: 106003]
>> [glusterd-server-quorum.c:349:glusterd_do_volume_quorum_action]
>> 0-management: Server quorum regained for volume filserver. Starting
>> local bricks.
>> Dec 20 18:33:49 printserver glusterd[278]: [2017-12-20
>> 17:33:49.008010] C [MSGID: 106003]
>> [glusterd-server-quorum.c:349:glusterd_do_volume_quorum_action]
>> 0-management: Server quorum regained for volume virt_images. Starting
>> local bricks.
>>
>> >>
>> >> I wonder if the killing of brick processes could have led to some sort
>> >> of
>> >> race condition where writes were serviced on one brick / the arbiter
>> >> and not
>> >> the other?
>> >>
>> >> If you can find a reproducer for this please open a BZ with it, I have
>> >> been seeing something similar(I think) but I haven't been able to run
>> >> the
>> >> issue down yet.
>> >>
>> >> -b
>>
>> I'm not sure if I can replicate this, a lot has been going on in my
>> setup the past few days (trying to tune some horrible small-file and
>> file creation/deletion performance).
>>
>> Thanks for looking into this with me.
>>
>> Best regards,
>> Henrik Juul Pedersen
>> LIAB ApS
>
>
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-users