Failed rebalance resulting in major problems

jdarcy at redhat.com (Jeff Darcy) · Mon, 11 Nov 2013 14:33:20 -0500

On 11/11/2013 02:15 PM, Shawn Heisey wrote:
> Is this possibly a result of my split-network architecture?  I have
> a total of six gluster peers. The four servers with bricks have two
> networks, both gigabit - a back-end network where they can talk to
> each other, and a network (with a default gateway) where they can
> talk to the other two peers.  Name resolution for gluster on those
> machines is done via hosts files that override DNS.  The hosts files
> use the back-end network, DNS uses the other network.
>
> The other two peers have no bricks, but act as NFS/CIFS entry points
> from the rest of the network - network access servers. Their name
> resolution is all DNS.  Those NAS servers also have a number of
> other network cards in them so that various networks can reach the
> storage without traversing our central firewall and overloading it.

There's nothing about a split-network configuration like yours that
would cause something like this *by itself*, but anything that creates
greater complexity also creates new possibilities for something to go
wrong.  Just to be safe, if I were you, I'd double- and triple-check the
DNS and /etc/hosts configurations on all machines to make sure some tiny
error didn't creep in.  If your bricks are at the same paths on each
machine, it would be possible for a machine to think it's connecting to
one brick and actually end up connecting to another.  I haven't even
been able to think through all of the ramifications, but just thinking
about how that might affect rebalance makes me a bit queasy.