I had opened another thread on this
mailing list (Subject: "After upgrade from 3.4.2
to 3.8.5 - High CPU usage resulting in disconnects
and split-brain").
The title may be a bit misleading now,
as I am no longer observing high CPU usage after
upgrading to 3.8.6, but the disconnects are still
happening and the number of files in split-brain
is growing.
Setup: 6 compute nodes, each serving as
a glusterfs server and client, Ubuntu 14.04, two
bricks per node, distribute-replicate
I have two gluster volumes set up (one
for scratch data, one for the slurm scheduler).
Only the scratch data volume shows critical errors
"[...] has not responded in the last 42 seconds,
disconnecting.". So I can rule out network
problems, the gigabit link between the nodes is
not saturated at all. The disks are almost idle
(<10%).
I have glusterfs 3.4.2 on Ubuntu 12.04
on a another compute cluster, running fine since
it was deployed.
I had glusterfs 3.4.2 on Ubuntu 14.04
on this cluster, running fine for almost a year.
After upgrading to 3.8.5, the problems
(as described) started. I would like to use some
of the new features of the newer versions (like
bitrot), but the users can't run their compute
jobs right now because the result files are
garbled.
There also seems to be a bug report
with a smiliar problem: (but no progress)
For me, ALL servers are affected (not
isolated to one or two servers)
For completeness (gv0 is the scratch
volume, gv2 the slurm volume):
[root@giant2: ~]# gluster v info
Volume Name: gv0
Type: Distributed-Replicate
Volume ID:
993ec7c9-e4bc-44d0-b7c4-2d977e622e86
Status: Started
Snapshot Count: 0
Number of Bricks: 6 x 2 = 12
Transport-type: tcp
Bricks:
Brick1: giant1:/gluster/sdc/gv0
Brick2: giant2:/gluster/sdc/gv0
Brick3: giant3:/gluster/sdc/gv0
Brick4: giant4:/gluster/sdc/gv0
Brick5: giant5:/gluster/sdc/gv0
Brick6: giant6:/gluster/sdc/gv0
Brick7: giant1:/gluster/sdd/gv0
Brick8: giant2:/gluster/sdd/gv0
Brick9: giant3:/gluster/sdd/gv0
Brick10: giant4:/gluster/sdd/gv0
Brick11: giant5:/gluster/sdd/gv0
Brick12: giant6:/gluster/sdd/gv0
Options Reconfigured:
auth.allow: X.X.X.*,127.0.0.1
nfs.disable: on
Volume Name: gv2
Type: Replicate
Volume ID:
30c78928-5f2c-4671-becc-8deaee1a7a8d
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: giant1:/gluster/sdd/gv2
Brick2: giant2:/gluster/sdd/gv2
Options Reconfigured:
auth.allow: X.X.X.*,127.0.0.1
cluster.granular-entry-heal: on
cluster.locking-scheme: granular
nfs.disable: on