Re: corruption using gluster and iSCSI with LIO

Olivier Lambert <lambert.olivier@xxxxxxxxx> · Fri, 18 Nov 2016 17:21:39 +0100

After Node 1 is DOWN, LIO on Node2 (iSCSI target) is not writing
anymore in the local Gluster mount, but in the root partition.

Despite "df -h" shows the Gluster brick mounted:

/dev/mapper/centos-root   3,1G    3,1G   20K 100% /
...
/dev/xvdb                  61G     61G  956M  99% /bricks/brick1
localhost:/gv0             61G     61G  956M  99% /mnt

If I unmount it, I still see the "block.img" in /mnt which is filling
the root space. So it's like Fuse is messing with the local Gluster
mount, which could lead to the data corruption on the client level.

It doesn't make sense for me... What am I missing?

On Fri, Nov 18, 2016 at 5:00 PM, Olivier Lambert
<lambert.olivier@xxxxxxxxx> wrote:
> Yes, I did it only if I have the previous result of heal info ("number
> of entries: 0"). But same result, as soon as the second Node is
> offline (after they were both working/back online), everything is
> corrupted.
>
> To recap:
>
> * Node 1 UP Node 2 UP -> OK
> * Node 1 UP Node 2 DOWN -> OK (just a small lag for multipath to see
> the path down and change if necessary)
> * Node 1 UP Node 2 UP -> OK (and waiting to have no entries displayed
> in heal command)
> * Node 1 DOWN Node 2 UP -> NOT OK (data corruption)
>
> On Fri, Nov 18, 2016 at 3:39 PM, David Gossage
> <dgossage@xxxxxxxxxxxxxxxxxx> wrote:
>> On Fri, Nov 18, 2016 at 3:49 AM, Olivier Lambert <lambert.olivier@xxxxxxxxx>
>> wrote:
>>>
>>> Hi David,
>>>
>>> What are the exact commands to be sure it's fine?
>>>
>>> Right now I got:
>>>
>>> # gluster volume heal gv0 info
>>> Brick 10.0.0.1:/bricks/brick1/gv0
>>> Status: Connected
>>> Number of entries: 0
>>>
>>> Brick 10.0.0.2:/bricks/brick1/gv0
>>> Status: Connected
>>> Number of entries: 0
>>>
>>> Brick 10.0.0.3:/bricks/brick1/gv0
>>> Status: Connected
>>> Number of entries: 0
>>>
>>>
>> Did you run this before taking down 2nd node to see if any heals were
>> ongoing?
>>
>> Also I see you have sharding enabled.  Are your files being served sharded
>> already as well?
>>
>>>
>>> Everything is online and working, but this command give a strange output:
>>>
>>> # gluster volume heal gv0 info heal-failed
>>> Gathering list of heal failed entries on volume gv0 has been
>>> unsuccessful on bricks that are down. Please check if all brick
>>> processes are running.
>>>
>>> Is it normal?
>>
>>
>> I don't think that is a valid command anymore as whern I run it I get same
>> message and this is in logs
>>  [2016-11-18 14:35:02.260503] I [MSGID: 106533]
>> [glusterd-volume-ops.c:878:__glusterd_handle_cli_heal_volume] 0-management:
>> Received heal vol req for volume GLUSTER1
>> [2016-11-18 14:35:02.263341] W [MSGID: 106530]
>> [glusterd-volume-ops.c:1882:glusterd_handle_heal_cmd] 0-management: Command
>> not supported. Please use "gluster volume heal GLUSTER1 info" and logs to
>> find the heal information.
>> [2016-11-18 14:35:02.263365] E [MSGID: 106301]
>> [glusterd-syncop.c:1297:gd_stage_op_phase] 0-management: Staging of
>> operation 'Volume Heal' failed on localhost : Command not supported. Please
>> use "gluster volume heal GLUSTER1 info" and logs to find the heal
>> information.
>>
>>>
>>> On Fri, Nov 18, 2016 at 2:51 AM, David Gossage
>>> <dgossage@xxxxxxxxxxxxxxxxxx> wrote:
>>> >
>>> > On Thu, Nov 17, 2016 at 6:42 PM, Olivier Lambert
>>> > <lambert.olivier@xxxxxxxxx>
>>> > wrote:
>>> >>
>>> >> Okay, used the exact same config you provided, and adding an arbiter
>>> >> node (node3)
>>> >>
>>> >> After halting node2, VM continues to work after a small "lag"/freeze.
>>> >> I restarted node2 and it was back online: OK
>>> >>
>>> >> Then, after waiting few minutes, halting node1. And **just** at this
>>> >> moment, the VM is corrupted (segmentation fault, /var/log folder empty
>>> >> etc.)
>>> >>
>>> > Other than waiting a few minutes did you make sure heals had completed?
>>> >
>>> >>
>>> >> dmesg of the VM:
>>> >>
>>> >> [ 1645.852905] EXT4-fs error (device xvda1):
>>> >> htree_dirblock_to_tree:988: inode #19: block 8286: comm bash: bad
>>> >> entry in directory: rec_len is smaller than minimal - offset=0(0),
>>> >> inode=0, rec_len=0, name_len=0
>>> >> [ 1645.854509] Aborting journal on device xvda1-8.
>>> >> [ 1645.855524] EXT4-fs (xvda1): Remounting filesystem read-only
>>> >>
>>> >> And got a lot of " comm bash: bad entry in directory" messages then...
>>> >>
>>> >> Here is the current config with all Node back online:
>>> >>
>>> >> # gluster volume info
>>> >>
>>> >> Volume Name: gv0
>>> >> Type: Replicate
>>> >> Volume ID: 5f15c919-57e3-4648-b20a-395d9fe3d7d6
>>> >> Status: Started
>>> >> Snapshot Count: 0
>>> >> Number of Bricks: 1 x (2 + 1) = 3
>>> >> Transport-type: tcp
>>> >> Bricks:
>>> >> Brick1: 10.0.0.1:/bricks/brick1/gv0
>>> >> Brick2: 10.0.0.2:/bricks/brick1/gv0
>>> >> Brick3: 10.0.0.3:/bricks/brick1/gv0 (arbiter)
>>> >> Options Reconfigured:
>>> >> nfs.disable: on
>>> >> performance.readdir-ahead: on
>>> >> transport.address-family: inet
>>> >> features.shard: on
>>> >> features.shard-block-size: 16MB
>>> >> network.remote-dio: enable
>>> >> cluster.eager-lock: enable
>>> >> performance.io-cache: off
>>> >> performance.read-ahead: off
>>> >> performance.quick-read: off
>>> >> performance.stat-prefetch: on
>>> >> performance.strict-write-ordering: off
>>> >> cluster.server-quorum-type: server
>>> >> cluster.quorum-type: auto
>>> >> cluster.data-self-heal: on
>>> >>
>>> >>
>>> >> # gluster volume status
>>> >> Status of volume: gv0
>>> >> Gluster process                             TCP Port  RDMA Port  Online
>>> >> Pid
>>> >>
>>> >>
>>> >> ------------------------------------------------------------------------------
>>> >> Brick 10.0.0.1:/bricks/brick1/gv0           49152     0          Y
>>> >> 1331
>>> >> Brick 10.0.0.2:/bricks/brick1/gv0           49152     0          Y
>>> >> 2274
>>> >> Brick 10.0.0.3:/bricks/brick1/gv0           49152     0          Y
>>> >> 2355
>>> >> Self-heal Daemon on localhost               N/A       N/A        Y
>>> >> 2300
>>> >> Self-heal Daemon on 10.0.0.3                N/A       N/A        Y
>>> >> 10530
>>> >> Self-heal Daemon on 10.0.0.2                N/A       N/A        Y
>>> >> 2425
>>> >>
>>> >> Task Status of Volume gv0
>>> >>
>>> >>
>>> >> ------------------------------------------------------------------------------
>>> >> There are no active volume tasks
>>> >>
>>> >>
>>> >>
>>> >> On Thu, Nov 17, 2016 at 11:35 PM, Olivier Lambert
>>> >> <lambert.olivier@xxxxxxxxx> wrote:
>>> >> > It's planned to have an arbiter soon :) It was just preliminary
>>> >> > tests.
>>> >> >
>>> >> > Thanks for the settings, I'll test this soon and I'll come back to
>>> >> > you!
>>> >> >
>>> >> > On Thu, Nov 17, 2016 at 11:29 PM, Lindsay Mathieson
>>> >> > <lindsay.mathieson@xxxxxxxxx> wrote:
>>> >> >> On 18/11/2016 8:17 AM, Olivier Lambert wrote:
>>> >> >>>
>>> >> >>> gluster volume info gv0
>>> >> >>>
>>> >> >>> Volume Name: gv0
>>> >> >>> Type: Replicate
>>> >> >>> Volume ID: 2f8658ed-0d9d-4a6f-a00b-96e9d3470b53
>>> >> >>> Status: Started
>>> >> >>> Snapshot Count: 0
>>> >> >>> Number of Bricks: 1 x 2 = 2
>>> >> >>> Transport-type: tcp
>>> >> >>> Bricks:
>>> >> >>> Brick1: 10.0.0.1:/bricks/brick1/gv0
>>> >> >>> Brick2: 10.0.0.2:/bricks/brick1/gv0
>>> >> >>> Options Reconfigured:
>>> >> >>> nfs.disable: on
>>> >> >>> performance.readdir-ahead: on
>>> >> >>> transport.address-family: inet
>>> >> >>> features.shard: on
>>> >> >>> features.shard-block-size: 16MB
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> When hosting VM's its essential to set these options:
>>> >> >>
>>> >> >> network.remote-dio: enable
>>> >> >> cluster.eager-lock: enable
>>> >> >> performance.io-cache: off
>>> >> >> performance.read-ahead: off
>>> >> >> performance.quick-read: off
>>> >> >> performance.stat-prefetch: on
>>> >> >> performance.strict-write-ordering: off
>>> >> >> cluster.server-quorum-type: server
>>> >> >> cluster.quorum-type: auto
>>> >> >> cluster.data-self-heal: on
>>> >> >>
>>> >> >> Also with replica two and quorum on (required) your volume will
>>> >> >> become
>>> >> >> read-only when one node goes down to prevent the possibility of
>>> >> >> split-brain
>>> >> >> - you *really* want to avoid that :)
>>> >> >>
>>> >> >> I'd recommend a replica 3 volume, that way 1 node can go down, but
>>> >> >> the
>>> >> >> other
>>> >> >> two still form a quorum and will remain r/w.
>>> >> >>
>>> >> >> If the extra disks are not possible, then a Arbiter volume can be
>>> >> >> setup
>>> >> >> -
>>> >> >> basically dummy files on the third node.
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> --
>>> >> >> Lindsay Mathieson
>>> >> >>
>>> >> >> _______________________________________________
>>> >> >> Gluster-users mailing list
>>> >> >> Gluster-users@xxxxxxxxxxx
>>> >> >> http://www.gluster.org/mailman/listinfo/gluster-users
>>> >> _______________________________________________
>>> >> Gluster-users mailing list
>>> >> Gluster-users@xxxxxxxxxxx
>>> >> http://www.gluster.org/mailman/listinfo/gluster-users
>>> >
>>> >
>>
>>
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users