Re: Random and frequent split brain

Pranith Kumar Karampuri <pkarampu@xxxxxxxxxx> · Thu, 17 Jul 2014 11:54:31 +0530

On 07/17/2014 08:41 AM, Nilesh Govindrajan wrote:
log1 and log2 are brick logs. The others are client logs.
I see a lot of logs as below in 'log1' you attached. It seems like the 
device ID of where the file where it is actually stored, where the 
gfid-link of the same file is stored i.e inside <brick-dir>/.glusterfs/ 
are different. What all devices/filesystems are present inside the brick 
represented by 'log1'?

[2014-07-16 00:00:24.358628] W [posix-handle.c:586:posix_handle_hard] 
0-home-posix: mismatching ino/dev between file 
/data/gluster/home/techiebuzz/techie-buzz.com/wp-content/cache/page_enhanced/techie-buzz.com/social-networking/facebook-will-permanently-remove-your-deleted-photos.html/_index.html.old 
(1077282838/2431) and handle 
/data/gluster/home/.glusterfs/ae/f0/aef0404b-e084-4501-9d0f-0e6f5bb2d5e0 
(1077282836/2431)
[2014-07-16 00:00:24.358646] E [posix.c:823:posix_mknod] 0-home-posix: 
setting gfid on 
/data/gluster/home/techiebuzz/techie-buzz.com/wp-content/cache/page_enhanced/techie-buzz.com/social-networking/facebook-will-permanently-remove-your-deleted-photos.html/_index.html.old 
failed

Pranith

On Thu, Jul 17, 2014 at 8:08 AM, Pranith Kumar Karampuri
<pkarampu@xxxxxxxxxx> wrote:
On 07/17/2014 07:28 AM, Nilesh Govindrajan wrote:
On Thu, Jul 17, 2014 at 7:26 AM, Nilesh Govindrajan <me@xxxxxxxxxxxx>
wrote:
Hello,

I'm having a weird issue. I have this config:

node2 ~ # gluster peer status
Number of Peers: 1

Hostname: sto1
Uuid: f7570524-811a-44ed-b2eb-d7acffadfaa5
State: Peer in Cluster (Connected)

node1 ~ # gluster peer status
Number of Peers: 1

Hostname: sto2
Port: 24007
Uuid: 3a69faa9-f622-4c35-ac5e-b14a6826f5d9
State: Peer in Cluster (Connected)

Volume Name: home
Type: Replicate
Volume ID: 54fef941-2e33-4acf-9e98-1f86ea4f35b7
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: sto1:/data/gluster/home
Brick2: sto2:/data/gluster/home
Options Reconfigured:
performance.write-behind-window-size: 2GB
performance.flush-behind: on
performance.cache-size: 2GB
cluster.choose-local: on
storage.linux-aio: on
transport.keepalive: on
performance.quick-read: on
performance.io-cache: on
performance.stat-prefetch: on
performance.read-ahead: on
cluster.data-self-heal-algorithm: diff
nfs.disable: on

sto1/2 is alias to node1/2 respectively.

As you see, NFS is disabled so I'm using the native fuse mount on both
nodes.
The volume contains files and php scripts that are served on various
websites. When both nodes are active, I get split brain on many files
and the mount on node2 going 'input/output error' on many of them
which causes HTTP 500 errors.

I delete the files from the brick using find -samefile. It fixes for a
few minutes and then the problem is back.

What could be the issue? This happens even if I use the NFS mounting
method.

Gluster 3.4.4 on Gentoo.
And yes, network connectivity is not an issue between them as both of
them are located in the same DC. They're connected via 1 Gbit line
(common for internal and external network) but external network
doesn't cross 200-500 Mbit/s leaving quite a good window for gluster.
I also tried enabling quorum but that doesn't help either.
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-users
hi Nilesh,
       Could you attach the mount, brick logs so that we can inspect what is
going on the setup.

Pranith

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-users