Jeff, > The main question is therefore why > we're losing connectivity to these servers. Could there be a hardware issue? I have replaced the network cables for the two servers but I don't really know what else to check. The network switch hasn't recorded any errors for those two ports. There isn't anything sinister in /var/log/messages. It seems a bit of a coincidence that both servers lost connection at exactly the same time. The only thing the users have started doing differently recently is processing a large number of small text files. There is one particular application they are running that processes this data, but the load on the Glusterfs servers doesn't go up when it is running. -Dan On 02/23/2012 02:41 PM, Jeff Darcy wrote: > On 02/23/2012 08:58 AM, Dan Bretherton wrote: >> It is reassuring to know that these errors are self repairing. That does >> appear to be happening, but only when I run "find -print0 | xargs --null stat >>> /dev/null" in affected directories. > Hm. Then maybe the xattrs weren't *set* on that brick. > >> I will run that self-heal on the whole >> volume as well, but I have had to start with specific directories that people >> want to work in today. Does repeating the fix-layout operation have any >> effect, or are the xattr repairs all done by the self-heal mechanism? > AFAICT the DHT self-heal mechanism (not to be confused with the better known > AFR self-heal mechanism) will take care of this. Running fix-layout would be > redundant for those directories, but not harmful. > >> I have found the cause of the transient brick failure; it happened again this >> morning on a replicated pair of bricks. Suddenly the >> etc-glusterfs-glusterd.vol.log file was flooded with these messages every few >> seconds. >> >> E [socket.c:2080:socket_connect] 0-management: connection attempt failed >> (Connection refused) >> >> One of the clients then reported errors like the following. >> >> [2012-02-23 11:19:22.922785] E [afr-common.c:3164:afr_notify] >> 2-atmos-replicate-3: All subvolumes are down. Going offline until atleast one >> of them comes back up. >> [2012-02-23 11:19:22.923682] I [dht-layout.c:581:dht_layout_normalize] >> 0-atmos-dht: found anomalies in /. holes=1 overlaps=0 > Bingo. This is exactly how DHT subvolume #3 could "miss out" on a directory > being created or updated, as seems to have happened. > >> [2012-02-23 11:19:22.923714] I [dht-selfheal.c:569:dht_selfheal_directory] >> 0-atmos-dht: 1 subvolumes down -- not fixing >> >> [2012-02-23 11:19:22.941468] W [socket.c:1494:__socket_proto_state_machine] >> 1-atmos-client-7: reading from socket failed. Error (Transport endpoint is not >> connected), peer (192.171.166.89:24019) >> [2012-02-23 11:19:22.972307] I [client.c:1883:client_rpc_notify] >> 1-atmos-client-7: disconnected >> [2012-02-23 11:19:22.972352] E [afr-common.c:3164:afr_notify] >> 1-atmos-replicate-3: All subvolumes are down. Going offline until atleast one >> of them comes back up. >> >> The servers causing trouble were still showing as Connected in "gluster peer >> status" and nothing appeared to be wrong except for glusterd misbehaving. >> Restarting glusterd solved the problem, but given that this has happened twice >> this week already I am worried that it could happen again at any time. Do you >> know what might be causing glusterd to stop responding like this? > The glusterd failures and the brick failures are likely to share a common > cause, as opposed to one causing the other. The main question is therefore why > we're losing connectivity to these servers. Secondarily, there might be a bug > to do with the failure being seen in the I/O path but not in the peer path, but > that's not likely to be the *essential* problem.