A lot more than 128-clients. Well over 1000. And, I believe we might have found the problem and it looks like you were headed in the right direction as it appears to be a problem with one of the clients FUSE mounts.
When we couldn't resolve the issue, I started moving all of my users off of the gluster storage system as it was no longer responsive. After moving all of them off, I tried to kill all of the clients that had homegfs mounted by doing a 'killall glusterfs' on all of the machines connected to gluster. There was one machine where even after killing all of the glusterfs processes and checking to make sure no glusterfs was running, 'mount' still showed the FUSE mount. After I did a 'umount -lf /homegfs' it finally went away.
After I killed the client mounts and restarted all of them, we haven't had any more issues with out of control loads on the storage systems. We had seen this before with a runaway FUSE mount, but we found the problem by looking at the load on all of the clients. The one problem node had an extremely high load that was out of the norm. When we went to that machine and did a reset of the FUSE mount, it cleared the problem. In this case, there was no indication of which of the clients was causing the issue and the only way to figure it out was to take the storage system out of production use.
My understanding is that the FUSE clients writes to both pairs in the replica at the same time. Does it make sense that it stopped writing to one of the pairs, and therefore, everything that was written by that FUSE mount had to be healed? In a normal scenario, there shouldn't be any (or very few) heals, right?
Is there any better way to trace out this issue in the future? Is there a way to figure out which mount is not connected properly or which mount is causing all of the heals? Or, alternatively, is there a way to force all of the clients to remount without going to all of the clients and killing the glusterfs process? This obviously becomes difficult in a scenario when you have thousands of clients connected.