I had opened another thread on this mailing
list (Subject: "After upgrade from 3.4.2 to 3.8.5 -
High CPU usage resulting in disconnects and
split-brain").
The title may be a bit misleading now, as I
am no longer observing high CPU usage after upgrading
to 3.8.6, but the disconnects are still happening and
the number of files in split-brain is growing.
Setup: 6 compute nodes, each serving as a
glusterfs server and client, Ubuntu 14.04, two bricks
per node, distribute-replicate
I have two gluster volumes set up (one for
scratch data, one for the slurm scheduler). Only the
scratch data volume shows critical errors "[...] has
not responded in the last 42 seconds, disconnecting.".
So I can rule out network problems, the gigabit link
between the nodes is not saturated at all. The disks
are almost idle (<10%).
I have glusterfs 3.4.2 on Ubuntu 12.04 on a
another compute cluster, running fine since it was
deployed.
I had glusterfs 3.4.2 on Ubuntu 14.04 on
this cluster, running fine for almost a year.
After upgrading to 3.8.5, the problems (as
described) started. I would like to use some of the
new features of the newer versions (like bitrot), but
the users can't run their compute jobs right now
because the result files are garbled.
There also seems to be a bug report with a
smiliar problem: (but no progress)
For me, ALL servers are affected (not
isolated to one or two servers)
For completeness (gv0 is the scratch
volume, gv2 the slurm volume):
[root@giant2: ~]# gluster v info
Volume Name: gv0
Type: Distributed-Replicate
Volume ID:
993ec7c9-e4bc-44d0-b7c4-2d977e622e86
Status: Started
Snapshot Count: 0
Number of Bricks: 6 x 2 = 12
Transport-type: tcp
Bricks:
Brick1: giant1:/gluster/sdc/gv0
Brick2: giant2:/gluster/sdc/gv0
Brick3: giant3:/gluster/sdc/gv0
Brick4: giant4:/gluster/sdc/gv0
Brick5: giant5:/gluster/sdc/gv0
Brick6: giant6:/gluster/sdc/gv0
Brick7: giant1:/gluster/sdd/gv0
Brick8: giant2:/gluster/sdd/gv0
Brick9: giant3:/gluster/sdd/gv0
Brick10: giant4:/gluster/sdd/gv0
Brick11: giant5:/gluster/sdd/gv0
Brick12: giant6:/gluster/sdd/gv0
Options Reconfigured:
auth.allow: X.X.X.*,127.0.0.1
nfs.disable: on
Volume Name: gv2
Type: Replicate
Volume ID:
30c78928-5f2c-4671-becc-8deaee1a7a8d
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: giant1:/gluster/sdd/gv2
Brick2: giant2:/gluster/sdd/gv2
Options Reconfigured:
auth.allow: X.X.X.*,127.0.0.1
cluster.granular-entry-heal: on
cluster.locking-scheme: granular
nfs.disable: on