A lot more than 128-clients. Well over 1000. And, I believe
we might have found the problem and it looks like you were
headed in the right direction as it appears to be a problem with
one of the clients FUSE mounts.
When we couldn't resolve the issue, I started moving all of
my users off of the gluster storage system as it was no longer
responsive. After moving all of them off, I tried to kill all
of the clients that had homegfs mounted by doing a 'killall
glusterfs' on all of the machines connected to gluster. There
was one machine where even after killing all of the glusterfs
processes and checking to make sure no glusterfs was running,
'mount' still showed the FUSE mount. After I did a 'umount -lf
/homegfs' it finally went away.
After I killed the client mounts and restarted all of them,
we haven't had any more issues with out of control loads on the
storage systems. We had seen this before with a runaway FUSE
mount, but we found the problem by looking at the load on all of
the clients. The one problem node had an extremely high load
that was out of the norm. When we went to that machine and did
a reset of the FUSE mount, it cleared the problem. In this
case, there was no indication of which of the clients was
causing the issue and the only way to figure it out was to take
the storage system out of production use.
My understanding is that the FUSE clients writes to both
pairs in the replica at the same time. Does it make sense that
it stopped writing to one of the pairs, and therefore,
everything that was written by that FUSE mount had to be
healed? In a normal scenario, there shouldn't be any (or very
few) heals, right?
Is there any better way to trace out this issue in the
future? Is there a way to figure out which mount is not
connected properly or which mount is causing all of the heals?
Or, alternatively, is there a way to force all of the clients to
remount without going to all of the clients and killing the
glusterfs process? This obviously becomes difficult in a
scenario when you have thousands of clients connected.