On 02/23/2012 11:45 AM, Dan Bretherton wrote: >> The main question is therefore why >> we're losing connectivity to these servers. > Could there be a hardware issue? I have replaced the network cables for > the two servers but I don't really know what else to check. The network > switch hasn't recorded any errors for those two ports. There isn't > anything sinister in /var/log/messages. > > It seems a bit of a coincidence that both servers lost connection at > exactly the same time. The only thing the users have started doing > differently recently is processing a large number of small text files. > There is one particular application they are running that processes this > data, but the load on the Glusterfs servers doesn't go up when it is > running. It does seem like a weird coincidence. About the only thing I can think of is that there's some combination of events that occurs on those two servers but not the others. For example, what if there's some file that happens to live on that replica pair, and which is accessed in some particularly pathological way? I used to see something like that with some astrophysics code that would try to open and truncate the same file from each of a thousand nodes simultaneously each time it started. Needless to say, this caused a few problems. ;) Maybe there's something about this new job type that similarly "converges" on one file for configuration, logging, something like that?