They both have dual-port Mellanox 20Gbps InfiniBand cards
with a straight (i.e. "crossover") cable and opensm to
facilitate the RDMA transport between them.
Here are some data dumps to set the stage (and yes, the
output of these commands looks the same on both nodes):
[root@duchess ~]# gluster volume info
Volume Name: gluster_disk
Type: Replicate
Volume ID: b1279e22-8589-407b-8671-3760f42e93e4
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: rdma
Bricks:
Brick1: duke-ib:/bricks/brick1
Brick2: duchess-ib:/bricks/brick1
[root@duchess ~]# gluster volume status
Status of volume: gluster_disk
Gluster process
Port Online Pid
------------------------------------------------------------------------------
Brick duke-ib:/bricks/brick1
49153 Y 9594
Brick duchess-ib:/bricks/brick1
49153 Y 9583
NFS Server on localhost
2049 Y 9590
Self-heal Daemon on localhost
N/A Y 9597
NFS Server on 10.10.10.1
2049 Y 9607
Self-heal Daemon on 10.10.10.1
N/A Y 9614
Task Status of Volume gluster_disk
------------------------------------------------------------------------------
There are no active volume tasks
[root@duchess ~]# gluster peer status
Number of Peers: 1
Hostname: 10.10.10.1
Uuid: aca56ec5-94bb-4bb0-8a9e-b3d134bbfe7b
State: Peer in Cluster (Connected)
So before putting any real data on these guys (the data will
eventually be a handful of large image files backing an iSCSI
target via tgtd for ESXi datastores), I wanted to simulate the
failure of one of the nodes. So I stopped glusterfsd and
glusterd on duchess, waited about 5 minutes, then started them
back up again, tail'ing /var/log/glusterfs/* and
/var/log/messages. I'm not sure exactly what I'm looking for,
but the logs quieted down after just a minute or so of
restarting the daemons. I didn't see much indicating that
self-healing was going on.
Every now and then (and seemingly more often than not), when
I run "gluster volume heal gluster_disk info", I get no output
from the command, and the following dumps into my
/var/log/messages:
Mar 15 13:59:16 duchess kernel: glfsheal[10365]: segfault at
7ff56068d020 ip 00007ff54f366d80 sp 00007ff54e22adf8 error 6
in libmthca-rdmav2.so[7ff54f365000+7000]