I
had opened another thread on this mailing list (Subject:
"After upgrade from 3.4.2 to 3.8.5 - High CPU usage
resulting in disconnects and split-brain").
The
title may be a bit misleading now, as I am no longer
observing high CPU usage after upgrading to 3.8.6, but
the disconnects are still happening and the number of
files in split-brain is growing.
Setup:
6 compute nodes, each serving as a glusterfs server and
client, Ubuntu 14.04, two bricks per node,
distribute-replicate
I
have two gluster volumes set up (one for scratch data,
one for the slurm scheduler). Only the scratch data
volume shows critical errors "[...] has not responded in
the last 42 seconds, disconnecting.". So I can rule out
network problems, the gigabit link between the nodes is
not saturated at all. The disks are almost idle
(<10%).
I
have glusterfs 3.4.2 on Ubuntu 12.04 on a another
compute cluster, running fine since it was deployed.
I
had glusterfs 3.4.2 on Ubuntu 14.04 on this cluster,
running fine for almost a year.
After
upgrading to 3.8.5, the problems (as described) started.
I would like to use some of the new features of the
newer versions (like bitrot), but the users can't run
their compute jobs right now because the result files
are garbled.
There
also seems to be a bug report with a smiliar problem:
(but no progress)
For
me, ALL servers are affected (not isolated to one or two
servers)
For
completeness (gv0 is the scratch volume, gv2 the slurm
volume):
[root@giant2:
~]# gluster v info
Volume
Name: gv0
Type:
Distributed-Replicate
Volume
ID: 993ec7c9-e4bc-44d0-b7c4-2d977e622e86
Status:
Started
Snapshot
Count: 0
Number
of Bricks: 6 x 2 = 12
Transport-type:
tcp
Bricks:
Brick1:
giant1:/gluster/sdc/gv0
Brick2:
giant2:/gluster/sdc/gv0
Brick3:
giant3:/gluster/sdc/gv0
Brick4:
giant4:/gluster/sdc/gv0
Brick5:
giant5:/gluster/sdc/gv0
Brick6:
giant6:/gluster/sdc/gv0
Brick7:
giant1:/gluster/sdd/gv0
Brick8:
giant2:/gluster/sdd/gv0
Brick9:
giant3:/gluster/sdd/gv0
Brick10:
giant4:/gluster/sdd/gv0
Brick11:
giant5:/gluster/sdd/gv0
Brick12:
giant6:/gluster/sdd/gv0
Options
Reconfigured:
auth.allow:
X.X.X.*,127.0.0.1
nfs.disable:
on
Volume
Name: gv2
Type:
Replicate
Volume
ID: 30c78928-5f2c-4671-becc-8deaee1a7a8d
Status:
Started
Snapshot
Count: 0
Number
of Bricks: 1 x 2 = 2
Transport-type:
tcp
Bricks:
Brick1:
giant1:/gluster/sdd/gv2
Brick2:
giant2:/gluster/sdd/gv2
Options
Reconfigured:
auth.allow:
X.X.X.*,127.0.0.1
cluster.granular-entry-heal:
on
cluster.locking-scheme:
granular
nfs.disable:
on