I
had opened another thread on this mailing list (Subject:
"After upgrade from 3.4.2 to 3.8.5 - High CPU usage
resulting in disconnects and split-brain").
The
title may be a bit misleading now, as I am no longer
observing high CPU usage after upgrading to 3.8.6, but the
disconnects are still happening and the number of files in
split-brain is growing.
Setup:
6 compute nodes, each serving as a glusterfs server and
client, Ubuntu 14.04, two bricks per node,
distribute-replicate
I
have two gluster volumes set up (one for scratch data, one
for the slurm scheduler). Only the scratch data volume
shows critical errors "[...] has not responded in the last
42 seconds, disconnecting.". So I can rule out network
problems, the gigabit link between the nodes is not
saturated at all. The disks are almost idle (<10%).
I
have glusterfs 3.4.2 on Ubuntu 12.04 on a another compute
cluster, running fine since it was deployed.
I
had glusterfs 3.4.2 on Ubuntu 14.04 on this cluster,
running fine for almost a year.
After
upgrading to 3.8.5, the problems (as described) started. I
would like to use some of the new features of the newer
versions (like bitrot), but the users can't run their
compute jobs right now because the result files are
garbled.
There
also seems to be a bug report with a smiliar problem: (but
no progress)
For
me, ALL servers are affected (not isolated to one or two
servers)
For
completeness (gv0 is the scratch volume, gv2 the slurm
volume):
[root@giant2:
~]# gluster v info
Volume
Name: gv0
Type:
Distributed-Replicate
Volume
ID: 993ec7c9-e4bc-44d0-b7c4-2d977e622e86
Status:
Started
Snapshot
Count: 0
Number
of Bricks: 6 x 2 = 12
Transport-type:
tcp
Bricks:
Brick1:
giant1:/gluster/sdc/gv0
Brick2:
giant2:/gluster/sdc/gv0
Brick3:
giant3:/gluster/sdc/gv0
Brick4:
giant4:/gluster/sdc/gv0
Brick5:
giant5:/gluster/sdc/gv0
Brick6:
giant6:/gluster/sdc/gv0
Brick7:
giant1:/gluster/sdd/gv0
Brick8:
giant2:/gluster/sdd/gv0
Brick9:
giant3:/gluster/sdd/gv0
Brick10:
giant4:/gluster/sdd/gv0
Brick11:
giant5:/gluster/sdd/gv0
Brick12:
giant6:/gluster/sdd/gv0
Options
Reconfigured:
auth.allow:
X.X.X.*,127.0.0.1
nfs.disable:
on
Volume
Name: gv2
Type:
Replicate
Volume
ID: 30c78928-5f2c-4671-becc-8deaee1a7a8d
Status:
Started
Snapshot
Count: 0
Number
of Bricks: 1 x 2 = 2
Transport-type:
tcp
Bricks:
Brick1:
giant1:/gluster/sdd/gv2
Brick2:
giant2:/gluster/sdd/gv2
Options
Reconfigured:
auth.allow:
X.X.X.*,127.0.0.1
cluster.granular-entry-heal:
on
cluster.locking-scheme:
granular
nfs.disable:
on