On 02/23/2012 08:58 AM, Dan Bretherton wrote: > It is reassuring to know that these errors are self repairing. That does > appear to be happening, but only when I run "find -print0 | xargs --null stat >>/dev/null" in affected directories. Hm. Then maybe the xattrs weren't *set* on that brick. > I will run that self-heal on the whole > volume as well, but I have had to start with specific directories that people > want to work in today. Does repeating the fix-layout operation have any > effect, or are the xattr repairs all done by the self-heal mechanism? AFAICT the DHT self-heal mechanism (not to be confused with the better known AFR self-heal mechanism) will take care of this. Running fix-layout would be redundant for those directories, but not harmful. > I have found the cause of the transient brick failure; it happened again this > morning on a replicated pair of bricks. Suddenly the > etc-glusterfs-glusterd.vol.log file was flooded with these messages every few > seconds. > > E [socket.c:2080:socket_connect] 0-management: connection attempt failed > (Connection refused) > > One of the clients then reported errors like the following. > > [2012-02-23 11:19:22.922785] E [afr-common.c:3164:afr_notify] > 2-atmos-replicate-3: All subvolumes are down. Going offline until atleast one > of them comes back up. > [2012-02-23 11:19:22.923682] I [dht-layout.c:581:dht_layout_normalize] > 0-atmos-dht: found anomalies in /. holes=1 overlaps=0 Bingo. This is exactly how DHT subvolume #3 could "miss out" on a directory being created or updated, as seems to have happened. > [2012-02-23 11:19:22.923714] I [dht-selfheal.c:569:dht_selfheal_directory] > 0-atmos-dht: 1 subvolumes down -- not fixing > > [2012-02-23 11:19:22.941468] W [socket.c:1494:__socket_proto_state_machine] > 1-atmos-client-7: reading from socket failed. Error (Transport endpoint is not > connected), peer (192.171.166.89:24019) > [2012-02-23 11:19:22.972307] I [client.c:1883:client_rpc_notify] > 1-atmos-client-7: disconnected > [2012-02-23 11:19:22.972352] E [afr-common.c:3164:afr_notify] > 1-atmos-replicate-3: All subvolumes are down. Going offline until atleast one > of them comes back up. > > The servers causing trouble were still showing as Connected in "gluster peer > status" and nothing appeared to be wrong except for glusterd misbehaving. > Restarting glusterd solved the problem, but given that this has happened twice > this week already I am worried that it could happen again at any time. Do you > know what might be causing glusterd to stop responding like this? The glusterd failures and the brick failures are likely to share a common cause, as opposed to one causing the other. The main question is therefore why we're losing connectivity to these servers. Secondarily, there might be a bug to do with the failure being seen in the I/O path but not in the peer path, but that's not likely to be the *essential* problem.