Hi, I reduced my setup to two nodes and did a graceful restart of one node (rod). Still same problem, split-brain on the vSphere-HA lock file. I found some additional log entries that might give som clues, especially the "CREATE (null)" error on the live node while the other is offline. [root@todd ~]# tail -F /var/log/glusterfs/bricks/data-gv0.log [2013-12-04 07:37:00.339843] I [server.c:762:server_rpc_notify] 0-gv0-server: disconnecting connectionfrom rod.roxen.com-23979-2013/12/04-07:20:52:76630-gv0-client-0-0 [2013-12-04 07:37:00.339866] I [server-helpers.c:729:server_connection_put] 0-gv0-server: Shutting down connection rod.roxen.com-23979-2013/12/04-07:20:52:76630-gv0-client-0-0 [2013-12-04 07:37:00.339889] I [server-helpers.c:617:server_connection_destroy] 0-gv0-server: destroyed connection of rod.roxen.com-23979-2013/12/04-07:20:52:76630-gv0-client-0-0 [2013-12-04 07:37:00.810926] I [server.c:762:server_rpc_notify] 0-gv0-server: disconnecting connectionfrom rod.roxen.com-2472-2013/12/03-16:06:42:234499-gv0-client-0-0 [2013-12-04 07:37:00.810950] I [server-helpers.c:729:server_connection_put] 0-gv0-server: Shutting down connection rod.roxen.com-2472-2013/12/03-16:06:42:234499-gv0-client-0-0 [2013-12-04 07:37:00.811005] I [server-helpers.c:617:server_connection_destroy] 0-gv0-server: destroyed connection of rod.roxen.com-2472-2013/12/03-16:06:42:234499-gv0-client-0-0 [2013-12-04 07:38:01.696398] I [server-rpc-fops.c:1618:server_create_cbk] 0-gv0-server: 445781: CREATE (null) (f0648215-68ff-441e-88aa-99a553c6d4e6/.lck-21133152dee76ab0) ==> (File exists) [2013-12-04 07:41:11.841299] I [server-handshake.c:567:server_setvolume] 0-gv0-server: accepted client from rod.roxen.com-2447-2013/12/04-07:41:11:718343-gv0-client-0-0 (version: 3.4.1) [2013-12-04 07:41:17.345764] I [server-handshake.c:567:server_setvolume] 0-gv0-server: accepted client from rod.roxen.com-2875-2013/12/04-07:41:17:416820-gv0-client-0-0 (version: 3.4.1) [2013-12-04 07:41:17.395322] I [server-handshake.c:567:server_setvolume] 0-gv0-server: accepted client from rod.roxen.com-2873-2013/12/04-07:41:17:400240-gv0-client-0-0 (version: 3.4.1) [root@rod ~]# tail -F /var/log/glusterfs/bricks/data-gv0.log [2013-12-04 07:41:17.928235] I [server-handshake.c:567:server_setvolume] 0-gv0-server: accepted client from todd.roxen.com-14615-2013/12/03-13:58:46:20150-gv0-client-1-0 (version: 3.4.1) [2013-12-04 07:41:18.273444] I [server-handshake.c:567:server_setvolume] 0-gv0-server: accepted client from todd.roxen.com-15161-2013/12/03-14:02:51:809483-gv0-client-1-0 (version: 3.4.1) [2013-12-04 07:41:18.372988] W [server-resolve.c:419:resolve_anonfd_simple] 0-server: inode for the gfid (79b19fbf-4fc9-45e4-bb2c-c0f7cabf3de5) is not found. anonymous fd creation failed [2013-12-04 07:41:18.373367] W [server-resolve.c:419:resolve_anonfd_simple] 0-server: inode for the gfid (79b19fbf-4fc9-45e4-bb2c-c0f7cabf3de5) is not found. anonymous fd creation failed [2013-12-04 07:41:18.860315] I [server-handshake.c:567:server_setvolume] 0-gv0-server: accepted client from rod.roxen.com-2447-2013/12/04-07:41:11:718343-gv0-client-1-0 (version: 3.4.1) [2013-12-04 07:41:20.030341] I [server-handshake.c:567:server_setvolume] 0-gv0-server: accepted client from todd.roxen.com-14617-2013/12/03-13:58:46:34506-gv0-client-1-0 (version: 3.4.1) [2013-12-04 07:41:20.597249] W [server-resolve.c:419:resolve_anonfd_simple] 0-server: inode for the gfid (8e7f2d59-51a4-4cd8-a8c1-2d55baccf1dc) is not found. anonymous fd creation failed [2013-12-04 07:41:20.597409] W [server-resolve.c:419:resolve_anonfd_simple] 0-server: inode for the gfid (3b6c53f2-cefc-46ad-81f3-0fdaa4c97414) is not found. anonymous fd creation failed [2013-12-04 07:41:20.597422] W [server-resolve.c:419:resolve_anonfd_simple] 0-server: inode for the gfid (8e7f2d59-51a4-4cd8-a8c1-2d55baccf1dc) is not found. anonymous fd creation failed [2013-12-04 07:41:20.597468] I [server-rpc-fops.c:293:server_finodelk_cbk] 0-gv0-server: 438245: FINODELK -2 (8e7f2d59-51a4-4cd8-a8c1-2d55baccf1dc) ==> (No such file or directory) [2013-12-04 07:41:20.597546] W [server-resolve.c:419:resolve_anonfd_simple] 0-server: inode for the gfid (62050a26-ab8c-4dd7-b1ac-c4be46a42cbb) is not found. anonymous fd creation failed [2013-12-04 07:41:20.597664] W [server-resolve.c:419:resolve_anonfd_simple] 0-server: inode for the gfid (86c7c599-3715-4df1-9a35-3ba8703cbf60) is not found. anonymous fd creation failed [2013-12-04 07:41:20.597823] W [server-resolve.c:419:resolve_anonfd_simple] 0-server: inode for the gfid (49fcdf96-5ea1-4005-b394-6c581ab93a64) is not found. anonymous fd creation failed [2013-12-04 07:41:20.598239] W [server-resolve.c:419:resolve_anonfd_simple] 0-server: inode for the gfid (8e7f2d59-51a4-4cd8-a8c1-2d55baccf1dc) is not found. anonymous fd creation failed [2013-12-04 07:41:20.638964] W [server-resolve.c:419:resolve_anonfd_simple] 0-server: inode for the gfid (8e7f2d59-51a4-4cd8-a8c1-2d55baccf1dc) is not found. anonymous fd creation failed [2013-12-04 07:41:20.638984] I [server-rpc-fops.c:293:server_finodelk_cbk] 0-gv0-server: 438250: FINODELK -2 (8e7f2d59-51a4-4cd8-a8c1-2d55baccf1dc) ==> (No such file or directory) [2013-12-04 07:41:20.639913] W [server-resolve.c:419:resolve_anonfd_simple] 0-server: inode for the gfid (8e7f2d59-51a4-4cd8-a8c1-2d55baccf1dc) is not found. anonymous fd creation failed [2013-12-04 07:41:20.640078] W [server-resolve.c:419:resolve_anonfd_simple] 0-server: inode for the gfid (8e7f2d59-51a4-4cd8-a8c1-2d55baccf1dc) is not found. anonymous fd creation failed [2013-12-04 07:41:20.640095] I [server-rpc-fops.c:1401:server_fsync_cbk] 0-gv0-server: 438252: FSYNC -2 (8e7f2d59-51a4-4cd8-a8c1-2d55baccf1dc) ==> (No such file or directory) [root@todd ~]# getfattr -m . -d -e hex /data/gv0/production/.vSphere-HA/FDM-DA596AD1-4A6C-4571-A3C8-2114B4FF61EA-5165-87b180a-vmware/.lck-21133152dee76ab0 getfattr: Removing leading '/' from absolute path names # file: data/gv0/production/.vSphere-HA/FDM-DA596AD1-4A6C-4571-A3C8-2114B4FF61EA-5165-87b180a-vmware/.lck-21133152dee76ab0 trusted.afr.gv0-client-0=0x000000000000000000000000 trusted.afr.gv0-client-1=0x000000410000000100000000 trusted.gfid=0x8e7f2d5951a44cd8a8c12d55baccf1dc [root@rod ~]# getfattr -m . -d -e hex /data/gv0/production/.vSphere-HA/FDM-DA596AD1-4A6C-4571-A3C8-2114B4FF61EA-5165-87b180a-vmware/.lck-21133152dee76ab0 getfattr: Removing leading '/' from absolute path names # file: data/gv0/production/.vSphere-HA/FDM-DA596AD1-4A6C-4571-A3C8-2114B4FF61EA-5165-87b180a-vmware/.lck-21133152dee76ab0 trusted.afr.gv0-client-0=0x000000000000000000000000 trusted.afr.gv0-client-1=0x000000000000000000000000 trusted.gfid=0x90fef179bc4e4b2d9cccd13a9a7d859f Regards, Marcus On Tue, 2013-12-03 at 09:01 +0100, Marcus Wellhardh wrote: > Hi, > > I did a trivial test to verify my delete/recreate theory: > > 1) File exists on all nodes. > 2) One node is powered down. > 3) File is deleted and recreated with same filename. > 4) Failing node is restarted. > 5) Self heal worked on the modified file. > > Glusterfs handled that above scenario perfectly. So the question is why > does self heal fail on the vSphere-HA lock file? Does anyone have a > troubleshoot idea? > > I am using: > > glusterfs-3.4.1-3.el6.x86_64 > CentOS release 6.4 > > Regards, > Marcus > > On Fri, 2013-11-29 at 14:05 +0100, Marcus Wellhardh wrote: > > Hi, > > > > I have a glusterfs volume replicated on three nodes. I am planing to use > > the volume as storage for vMware ESXi machines using NFS. The reason for > > using tree nodes is to be able to configure Quorum and avoid > > split-brains. However, during my initial testing when intentionally and > > gracefully restart the node "ned", a split-brain/self-heal error > > occurred. > > > > The log on "todd" and "rod" gives: > > > > [2013-11-29 12:34:14.614456] E [afr-self-heal-data.c:1270:afr_sh_data_open_cbk] 0-gv0-replicate-0: open of <gfid:09b6d1d7-e583-4cee-93a4-4e972346ade3> failed on child gv0-client-2 (No such file or directory) > > > > The reason is probably that the file was deleted and recreated with the > > same file name during the time the node was offline, i.e. new inode and > > thus new gfid. > > > > Is this expected? Is it possible to configure the volume to > > automatically handle this? > > > > The same problem happens every time I test a restart. It looks like > > Vmware is constantly creating new lock-files for the vSphere-HA > > directory. > > > > Below you will find various information about the glusterfs volume. I > > have also attached the full logs for all three nodes. > > > > [root@todd ~]# gluster volume info > > > > Volume Name: gv0 > > Type: Replicate > > Volume ID: a847a533-9509-48c5-9c18-a40b48426fbc > > Status: Started > > Number of Bricks: 1 x 3 = 3 > > Transport-type: tcp > > Bricks: > > Brick1: todd-storage:/data/gv0 > > Brick2: rod-storage:/data/gv0 > > Brick3: ned-storage:/data/gv0 > > Options Reconfigured: > > cluster.server-quorum-type: server > > cluster.server-quorum-ratio: 51% > > > > [root@todd ~]# gluster volume heal gv0 info > > Gathering Heal info on volume gv0 has been successful > > > > Brick todd-storage:/data/gv0 > > Number of entries: 2 > > /production-cluster/.vSphere-HA/FDM-DA596AD1-4A6C-4571-A3C8-2114B4FF61EA-5034-b6e1d26-vmware > > /production-cluster/.vSphere-HA/FDM-DA596AD1-4A6C-4571-A3C8-2114B4FF61EA-5034-b6e1d26-vmware/.lck-5e711126a297a6bb > > > > Brick rod-storage:/data/gv0 > > Number of entries: 2 > > /production-cluster/.vSphere-HA/FDM-DA596AD1-4A6C-4571-A3C8-2114B4FF61EA-5034-b6e1d26-vmware > > /production-cluster/.vSphere-HA/FDM-DA596AD1-4A6C-4571-A3C8-2114B4FF61EA-5034-b6e1d26-vmware/.lck-5e711126a297a6bb > > > > Brick ned-storage:/data/gv0 > > Number of entries: 0 > > > > [root@todd ~]# getfattr -m . -d -e hex /data/gv0/production-cluster/.vSphere-HA/FDM-DA596AD1-4A6C-4571-A3C8-2114B4FF61EA-5034-b6e1d26-vmware/.lck-5e711126a297a6bb > > getfattr: Removing leading '/' from absolute path names > > # file: data/gv0/production-cluster/.vSphere-HA/FDM-DA596AD1-4A6C-4571-A3C8-2114B4FF61EA-5034-b6e1d26-vmware/.lck-5e711126a297a6bb > > trusted.afr.gv0-client-0=0x000000000000000000000000 > > trusted.afr.gv0-client-1=0x000000000000000000000000 > > trusted.afr.gv0-client-2=0x000002810000000100000000 > > trusted.gfid=0x09b6d1d7e5834cee93a44e972346ade3 > > > > [root@todd ~]# stat /data/gv0/production-cluster/.vSphere-HA/FDM-DA596AD1-4A6C-4571-A3C8-2114B4FF61EA-5034-b6e1d26-vmware/.lck-5e711126a297a6bb > > File: `/data/gv0/production-cluster/.vSphere-HA/FDM-DA596AD1-4A6C-4571-A3C8-2114B4FF61EA-5034-b6e1d26-vmware/.lck-5e711126a297a6bb' > > Size: 84 Blocks: 8 IO Block: 4096 regular file > > Device: fd03h/64771d Inode: 1191 Links: 2 > > Access: (0775/-rwxrwxr-x) Uid: ( 0/ root) Gid: ( 0/ root) > > Access: 2013-11-29 11:38:36.285091183 +0100 > > Modify: 2013-11-29 13:26:24.668822831 +0100 > > Change: 2013-11-29 13:26:24.668822831 +0100 > > > > [root@rod ~]# getfattr -m . -d -e hex /data/gv0/production-cluster/.vSphere-HA/FDM-DA596AD1-4A6C-4571-A3C8-2114B4FF61EA-5034-b6e1d26-vmware/.lck-5e711126a297a6bb > > getfattr: Removing leading '/' from absolute path names > > # file: data/gv0/production-cluster/.vSphere-HA/FDM-DA596AD1-4A6C-4571-A3C8-2114B4FF61EA-5034-b6e1d26-vmware/.lck-5e711126a297a6bb > > trusted.afr.gv0-client-0=0x000000000000000000000000 > > trusted.afr.gv0-client-1=0x000000000000000000000000 > > trusted.afr.gv0-client-2=0x000002810000000100000000 > > trusted.gfid=0x09b6d1d7e5834cee93a44e972346ade3 > > > > [root@rod ~]# stat /data/gv0/production-cluster/.vSphere-HA/FDM-DA596AD1-4A6C-4571-A3C8-2114B4FF61EA-5034-b6e1d26-vmware/.lck-5e711126a297a6bb > > File: `/data/gv0/production-cluster/.vSphere-HA/FDM-DA596AD1-4A6C-4571-A3C8-2114B4FF61EA-5034-b6e1d26-vmware/.lck-5e711126a297a6bb' > > Size: 84 Blocks: 8 IO Block: 4096 regular file > > Device: fd03h/64771d Inode: 1558 Links: 2 > > Access: (0775/-rwxrwxr-x) Uid: ( 0/ root) Gid: ( 0/ root) > > Access: 2013-11-29 11:38:36.284671510 +0100 > > Modify: 2013-11-29 13:26:24.668985155 +0100 > > Change: 2013-11-29 13:26:24.669985185 +0100 > > > > [root@ned ~]# getfattr -m . -d -e hex /data/gv0/production-cluster/.vSphere-HA/FDM-DA596AD1-4A6C-4571-A3C8-2114B4FF61EA-5034-b6e1d26-vmware/.lck-5e711126a297a6bb > > getfattr: Removing leading '/' from absolute path names > > # file: data/gv0/production-cluster/.vSphere-HA/FDM-DA596AD1-4A6C-4571-A3C8-2114B4FF61EA-5034-b6e1d26-vmware/.lck-5e711126a297a6bb > > trusted.afr.gv0-client-0=0x000000000000000000000000 > > trusted.afr.gv0-client-1=0x000000000000000000000000 > > trusted.afr.gv0-client-2=0x000000000000000000000000 > > trusted.gfid=0x76caf49a25d74ebdb711a562412bee43 > > > > [root@ned ~]# stat /data/gv0/production-cluster/.vSphere-HA/FDM-DA596AD1-4A6C-4571-A3C8-2114B4FF61EA-5034-b6e1d26-vmware/.lck-5e711126a297a6bb > > File: `/data/gv0/production-cluster/.vSphere-HA/FDM-DA596AD1-4A6C-4571-A3C8-2114B4FF61EA-5034-b6e1d26-vmware/.lck-5e711126a297a6bb' > > Size: 84 Blocks: 8 IO Block: 4096 regular file > > Device: fd03h/64771d Inode: 4545 Links: 2 > > Access: (0775/-rwxrwxr-x) Uid: ( 0/ root) Gid: ( 0/ root) > > Access: 2013-11-29 11:34:45.199330329 +0100 > > Modify: 2013-11-29 11:37:03.773330311 +0100 > > Change: 2013-11-29 11:37:03.773330311 +0100 > > > > Regards, > > Marcus Wellhardh > > _______________________________________________ > > Gluster-users mailing list > > Gluster-users@xxxxxxxxxxx > > http://supercolony.gluster.org/mailman/listinfo/gluster-users > > > _______________________________________________ > Gluster-users mailing list > Gluster-users@xxxxxxxxxxx > http://supercolony.gluster.org/mailman/listinfo/gluster-users _______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://supercolony.gluster.org/mailman/listinfo/gluster-users