Hi,
It can be that LIO service starts before /mnt gets mounted. In absence of backend file LIO has created the new one on root filesystem (/mnt directory). Then gluster volume was mounted over, but as backend file was kept open by LIO - it still was used instead of the right one on gluster volume. Then, when you turn off the first node - active path for iSCSI disk switches to the second node (with empty file, placed on root filesystem).
After Node 1 is DOWN, LIO on Node2 (iSCSI target) is not writing anymore in the local Gluster mount, but in the root partition. Despite "df -h" shows the Gluster brick mounted: /dev/mapper/centos-root 3,1G 3,1G 20K 100% / ... /dev/xvdb 61G 61G 956M 99% /bricks/brick1 localhost:/gv0 61G 61G 956M 99% /mnt If I unmount it, I still see the "block.img" in /mnt which is filling the root space. So it's like Fuse is messing with the local Gluster mount, which could lead to the data corruption on the client level. It doesn't make sense for me... What am I missing? On Fri, Nov 18, 2016 at 5:00 PM, Olivier Lambert < lambert.olivier@xxxxxxxxx> wrote: Yes, I did it only if I have the previous result of heal info ("number of entries: 0"). But same result, as soon as the second Node is offline (after they were both working/back online), everything is corrupted.
To recap:
* Node 1 UP Node 2 UP -> OK * Node 1 UP Node 2 DOWN -> OK (just a small lag for multipath to see the path down and change if necessary) * Node 1 UP Node 2 UP -> OK (and waiting to have no entries displayed in heal command) * Node 1 DOWN Node 2 UP -> NOT OK (data corruption)
On Fri, Nov 18, 2016 at 3:39 PM, David Gossage <dgossage@xxxxxxxxxxxxxxxxxx> wrote:
On Fri, Nov 18, 2016 at 3:49 AM, Olivier Lambert <lambert.olivier@xxxxxxxxx> wrote:
Hi David,
What are the exact commands to be sure it's fine?
Right now I got:
# gluster volume heal gv0 info Brick 10.0.0.1:/bricks/brick1/gv0 Status: Connected Number of entries: 0
Brick 10.0.0.2:/bricks/brick1/gv0 Status: Connected Number of entries: 0
Brick 10.0.0.3:/bricks/brick1/gv0 Status: Connected Number of entries: 0
Did you run this before taking down 2nd node to see if any heals were ongoing?
Also I see you have sharding enabled. Are your files being served sharded already as well?
Everything is online and working, but this command give a strange output:
# gluster volume heal gv0 info heal-failed Gathering list of heal failed entries on volume gv0 has been unsuccessful on bricks that are down. Please check if all brick processes are running.
Is it normal?
I don't think that is a valid command anymore as whern I run it I get same message and this is in logs [2016-11-18 14:35:02.260503] I [MSGID: 106533] [glusterd-volume-ops.c:878:__glusterd_handle_cli_heal_volume] 0-management: Received heal vol req for volume GLUSTER1 [2016-11-18 14:35:02.263341] W [MSGID: 106530] [glusterd-volume-ops.c:1882:glusterd_handle_heal_cmd] 0-management: Command not supported. Please use "gluster volume heal GLUSTER1 info" and logs to find the heal information. [2016-11-18 14:35:02.263365] E [MSGID: 106301] [glusterd-syncop.c:1297:gd_stage_op_phase] 0-management: Staging of operation 'Volume Heal' failed on localhost : Command not supported. Please use "gluster volume heal GLUSTER1 info" and logs to find the heal information.
On Fri, Nov 18, 2016 at 2:51 AM, David Gossage <dgossage@xxxxxxxxxxxxxxxxxx> wrote:
On Thu, Nov 17, 2016 at 6:42 PM, Olivier Lambert <lambert.olivier@xxxxxxxxx> wrote:
Okay, used the exact same config you provided, and adding an arbiter node (node3)
After halting node2, VM continues to work after a small "lag"/freeze. I restarted node2 and it was back online: OK
Then, after waiting few minutes, halting node1. And **just** at this moment, the VM is corrupted (segmentation fault, /var/log folder empty etc.)
Other than waiting a few minutes did you make sure heals had completed?
dmesg of the VM:
[ 1645.852905] EXT4-fs error (device xvda1): htree_dirblock_to_tree:988: inode #19: block 8286: comm bash: bad entry in directory: rec_len is smaller than minimal - offset=0(0), inode=0, rec_len=0, name_len=0 [ 1645.854509] Aborting journal on device xvda1-8. [ 1645.855524] EXT4-fs (xvda1): Remounting filesystem read-only
And got a lot of " comm bash: bad entry in directory" messages then...
Here is the current config with all Node back online:
# gluster volume info
Volume Name: gv0 Type: Replicate Volume ID: 5f15c919-57e3-4648-b20a-395d9fe3d7d6 Status: Started Snapshot Count: 0 Number of Bricks: 1 x (2 + 1) = 3 Transport-type: tcp Bricks: Brick1: 10.0.0.1:/bricks/brick1/gv0 Brick2: 10.0.0.2:/bricks/brick1/gv0 Brick3: 10.0.0.3:/bricks/brick1/gv0 (arbiter) Options Reconfigured: nfs.disable: on performance.readdir-ahead: on transport.address-family: inet features.shard: on features.shard-block-size: 16MB network.remote-dio: enable cluster.eager-lock: enable performance.io-cache: off performance.read-ahead: off performance.quick-read: off performance.stat-prefetch: on performance.strict-write-ordering: off cluster.server-quorum-type: server cluster.quorum-type: auto cluster.data-self-heal: on
# gluster volume status Status of volume: gv0 Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------ Brick 10.0.0.1:/bricks/brick1/gv0 49152 0 Y 1331 Brick 10.0.0.2:/bricks/brick1/gv0 49152 0 Y 2274 Brick 10.0.0.3:/bricks/brick1/gv0 49152 0 Y 2355 Self-heal Daemon on localhost N/A N/A Y 2300 Self-heal Daemon on 10.0.0.3 N/A N/A Y 10530 Self-heal Daemon on 10.0.0.2 N/A N/A Y 2425
Task Status of Volume gv0
------------------------------------------------------------------------------ There are no active volume tasks
On Thu, Nov 17, 2016 at 11:35 PM, Olivier Lambert <lambert.olivier@xxxxxxxxx> wrote:
It's planned to have an arbiter soon :) It was just preliminary tests.
Thanks for the settings, I'll test this soon and I'll come back to you!
On Thu, Nov 17, 2016 at 11:29 PM, Lindsay Mathieson <lindsay.mathieson@xxxxxxxxx> wrote:
On 18/11/2016 8:17 AM, Olivier Lambert wrote:
gluster volume info gv0
Volume Name: gv0 Type: Replicate Volume ID: 2f8658ed-0d9d-4a6f-a00b-96e9d3470b53 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: 10.0.0.1:/bricks/brick1/gv0 Brick2: 10.0.0.2:/bricks/brick1/gv0 Options Reconfigured: nfs.disable: on performance.readdir-ahead: on transport.address-family: inet features.shard: on features.shard-block-size: 16MB
When hosting VM's its essential to set these options:
network.remote-dio: enable cluster.eager-lock: enable performance.io-cache: off performance.read-ahead: off performance.quick-read: off performance.stat-prefetch: on performance.strict-write-ordering: off cluster.server-quorum-type: server cluster.quorum-type: auto cluster.data-self-heal: on
Also with replica two and quorum on (required) your volume will become read-only when one node goes down to prevent the possibility of split-brain - you *really* want to avoid that :)
I'd recommend a replica 3 volume, that way 1 node can go down, but the other two still form a quorum and will remain r/w.
If the extra disks are not possible, then a Arbiter volume can be setup - basically dummy files on the third node.
-- Lindsay Mathieson
_______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-users
_______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-users
_______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxxhttp://www.gluster.org/mailman/listinfo/gluster-users
-- Дмитрий Глушенок Инфосистемы Джет +7-910-453-2568
|