"mismatching layouts" errors after expanding volume

jdarcy at redhat.com (Jeff Darcy) · Thu, 23 Feb 2012 09:41:50 -0500

On 02/23/2012 08:58 AM, Dan Bretherton wrote:
> It is reassuring to know that these errors are self repairing.  That does
> appear to be happening, but only when I run "find -print0 | xargs --null stat
>>/dev/null" in affected directories.

Hm.  Then maybe the xattrs weren't *set* on that brick.

> I will run that self-heal on the whole
> volume as well, but I have had to start with specific directories that people
> want to work in today.  Does repeating the fix-layout operation have any
> effect, or are the xattr repairs all done by the self-heal mechanism?

AFAICT the DHT self-heal mechanism (not to be confused with the better known
AFR self-heal mechanism) will take care of this.  Running fix-layout would be
redundant for those directories, but not harmful.

> I have found the cause of the transient brick failure; it happened again this
> morning on a replicated pair of bricks.  Suddenly the
> etc-glusterfs-glusterd.vol.log file was flooded with these messages every few
> seconds.
> 
> E [socket.c:2080:socket_connect] 0-management: connection attempt failed
> (Connection refused)
> 
> One of the clients then reported errors like the following.
> 
> [2012-02-23 11:19:22.922785] E [afr-common.c:3164:afr_notify]
> 2-atmos-replicate-3: All subvolumes are down. Going offline until atleast one
> of them comes back up.
> [2012-02-23 11:19:22.923682] I [dht-layout.c:581:dht_layout_normalize]
> 0-atmos-dht: found anomalies in /. holes=1 overlaps=0

Bingo.  This is exactly how DHT subvolume #3 could "miss out" on a directory
being created or updated, as seems to have happened.

> [2012-02-23 11:19:22.923714] I [dht-selfheal.c:569:dht_selfheal_directory]
> 0-atmos-dht: 1 subvolumes down -- not fixing
> 
> [2012-02-23 11:19:22.941468] W [socket.c:1494:__socket_proto_state_machine]
> 1-atmos-client-7: reading from socket failed. Error (Transport endpoint is not
> connected), peer (192.171.166.89:24019)
> [2012-02-23 11:19:22.972307] I [client.c:1883:client_rpc_notify]
> 1-atmos-client-7: disconnected
> [2012-02-23 11:19:22.972352] E [afr-common.c:3164:afr_notify]
> 1-atmos-replicate-3: All subvolumes are down. Going offline until atleast one
> of them comes back up.
> 
> The servers causing trouble were still showing as Connected in "gluster peer
> status" and nothing appeared to be wrong except for glusterd misbehaving. 
> Restarting glusterd solved the problem, but given that this has happened twice
> this week already I am worried that it could happen again at any time.  Do you
> know what might be causing glusterd to stop responding like this?

The glusterd failures and the brick failures are likely to share a common
cause, as opposed to one causing the other.  The main question is therefore why
we're losing connectivity to these servers.  Secondarily, there might be a bug
to do with the failure being seen in the I/O path but not in the peer path, but
that's not likely to be the *essential* problem.