Re: corruption using gluster and iSCSI with LIO

Joe Julian <joe@xxxxxxxxxxxxxxxx> · Fri, 18 Nov 2016 08:37:47 -0800

If it's writing to the root partition then the mount went away. Any 
clues in the gluster client log?

On 11/18/2016 08:21 AM, Olivier Lambert wrote:
After Node 1 is DOWN, LIO on Node2 (iSCSI target) is not writing
anymore in the local Gluster mount, but in the root partition.

Despite "df -h" shows the Gluster brick mounted:

/dev/mapper/centos-root   3,1G    3,1G   20K 100% /
...
/dev/xvdb                  61G     61G  956M  99% /bricks/brick1
localhost:/gv0             61G     61G  956M  99% /mnt

If I unmount it, I still see the "block.img" in /mnt which is filling
the root space. So it's like Fuse is messing with the local Gluster
mount, which could lead to the data corruption on the client level.

It doesn't make sense for me... What am I missing?

On Fri, Nov 18, 2016 at 5:00 PM, Olivier Lambert
<lambert.olivier@xxxxxxxxx> wrote:
Yes, I did it only if I have the previous result of heal info ("number
of entries: 0"). But same result, as soon as the second Node is
offline (after they were both working/back online), everything is
corrupted.

To recap:

* Node 1 UP Node 2 UP -> OK
* Node 1 UP Node 2 DOWN -> OK (just a small lag for multipath to see
the path down and change if necessary)
* Node 1 UP Node 2 UP -> OK (and waiting to have no entries displayed
in heal command)
* Node 1 DOWN Node 2 UP -> NOT OK (data corruption)

On Fri, Nov 18, 2016 at 3:39 PM, David Gossage
<dgossage@xxxxxxxxxxxxxxxxxx> wrote:
On Fri, Nov 18, 2016 at 3:49 AM, Olivier Lambert <lambert.olivier@xxxxxxxxx>
wrote:
Hi David,

What are the exact commands to be sure it's fine?

Right now I got:

# gluster volume heal gv0 info
Brick 10.0.0.1:/bricks/brick1/gv0
Status: Connected
Number of entries: 0

Brick 10.0.0.2:/bricks/brick1/gv0
Status: Connected
Number of entries: 0

Brick 10.0.0.3:/bricks/brick1/gv0
Status: Connected
Number of entries: 0

Did you run this before taking down 2nd node to see if any heals were
ongoing?

Also I see you have sharding enabled.  Are your files being served sharded
already as well?

Everything is online and working, but this command give a strange output:

# gluster volume heal gv0 info heal-failed
Gathering list of heal failed entries on volume gv0 has been
unsuccessful on bricks that are down. Please check if all brick
processes are running.

Is it normal?

I don't think that is a valid command anymore as whern I run it I get same
message and this is in logs
  [2016-11-18 14:35:02.260503] I [MSGID: 106533]
[glusterd-volume-ops.c:878:__glusterd_handle_cli_heal_volume] 0-management:
Received heal vol req for volume GLUSTER1
[2016-11-18 14:35:02.263341] W [MSGID: 106530]
[glusterd-volume-ops.c:1882:glusterd_handle_heal_cmd] 0-management: Command
not supported. Please use "gluster volume heal GLUSTER1 info" and logs to
find the heal information.
[2016-11-18 14:35:02.263365] E [MSGID: 106301]
[glusterd-syncop.c:1297:gd_stage_op_phase] 0-management: Staging of
operation 'Volume Heal' failed on localhost : Command not supported. Please
use "gluster volume heal GLUSTER1 info" and logs to find the heal
information.

On Fri, Nov 18, 2016 at 2:51 AM, David Gossage
<dgossage@xxxxxxxxxxxxxxxxxx> wrote:
On Thu, Nov 17, 2016 at 6:42 PM, Olivier Lambert
<lambert.olivier@xxxxxxxxx>
wrote:
Okay, used the exact same config you provided, and adding an arbiter
node (node3)

After halting node2, VM continues to work after a small "lag"/freeze.
I restarted node2 and it was back online: OK

Then, after waiting few minutes, halting node1. And **just** at this
moment, the VM is corrupted (segmentation fault, /var/log folder empty
etc.)

Other than waiting a few minutes did you make sure heals had completed?

dmesg of the VM:

[ 1645.852905] EXT4-fs error (device xvda1):
htree_dirblock_to_tree:988: inode #19: block 8286: comm bash: bad
entry in directory: rec_len is smaller than minimal - offset=0(0),
inode=0, rec_len=0, name_len=0
[ 1645.854509] Aborting journal on device xvda1-8.
[ 1645.855524] EXT4-fs (xvda1): Remounting filesystem read-only

And got a lot of " comm bash: bad entry in directory" messages then...

Here is the current config with all Node back online:

# gluster volume info

Volume Name: gv0
Type: Replicate
Volume ID: 5f15c919-57e3-4648-b20a-395d9fe3d7d6
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: 10.0.0.1:/bricks/brick1/gv0
Brick2: 10.0.0.2:/bricks/brick1/gv0
Brick3: 10.0.0.3:/bricks/brick1/gv0 (arbiter)
Options Reconfigured:
nfs.disable: on
performance.readdir-ahead: on
transport.address-family: inet
features.shard: on
features.shard-block-size: 16MB
network.remote-dio: enable
cluster.eager-lock: enable
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
performance.stat-prefetch: on
performance.strict-write-ordering: off
cluster.server-quorum-type: server
cluster.quorum-type: auto
cluster.data-self-heal: on

# gluster volume status
Status of volume: gv0
Gluster process                             TCP Port  RDMA Port  Online
Pid

------------------------------------------------------------------------------
Brick 10.0.0.1:/bricks/brick1/gv0           49152     0          Y
1331
Brick 10.0.0.2:/bricks/brick1/gv0           49152     0          Y
2274
Brick 10.0.0.3:/bricks/brick1/gv0           49152     0          Y
2355
Self-heal Daemon on localhost               N/A       N/A        Y
2300
Self-heal Daemon on 10.0.0.3                N/A       N/A        Y
10530
Self-heal Daemon on 10.0.0.2                N/A       N/A        Y
2425

Task Status of Volume gv0

------------------------------------------------------------------------------
There are no active volume tasks

On Thu, Nov 17, 2016 at 11:35 PM, Olivier Lambert
<lambert.olivier@xxxxxxxxx> wrote:
It's planned to have an arbiter soon :) It was just preliminary
tests.

Thanks for the settings, I'll test this soon and I'll come back to
you!

On Thu, Nov 17, 2016 at 11:29 PM, Lindsay Mathieson
<lindsay.mathieson@xxxxxxxxx> wrote:
On 18/11/2016 8:17 AM, Olivier Lambert wrote:
gluster volume info gv0

Volume Name: gv0
Type: Replicate
Volume ID: 2f8658ed-0d9d-4a6f-a00b-96e9d3470b53
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: 10.0.0.1:/bricks/brick1/gv0
Brick2: 10.0.0.2:/bricks/brick1/gv0
Options Reconfigured:
nfs.disable: on
performance.readdir-ahead: on
transport.address-family: inet
features.shard: on
features.shard-block-size: 16MB

When hosting VM's its essential to set these options:

network.remote-dio: enable
cluster.eager-lock: enable
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
performance.stat-prefetch: on
performance.strict-write-ordering: off
cluster.server-quorum-type: server
cluster.quorum-type: auto
cluster.data-self-heal: on

Also with replica two and quorum on (required) your volume will
become
read-only when one node goes down to prevent the possibility of
split-brain
- you *really* want to avoid that :)

I'd recommend a replica 3 volume, that way 1 node can go down, but
the
other
two still form a quorum and will remain r/w.

If the extra disks are not possible, then a Arbiter volume can be
setup
-
basically dummy files on the third node.

--
Lindsay Mathieson

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users