Hi, On Tue, 2013-11-26 at 13:13 +0200, Vladimir Melnik wrote: > On Tue, Nov 26, 2013 at 09:59:34AM +0000, Steven Whitehouse wrote: > > Looking at the logs, I see that it looks like recovery has got stuck for > > one of the nodes, since the log is complaining that it has taken a long > > time for kslowd to run. > > So that suggests that the other node is currently fenced, and only one > > node is working anyway. If that is not the case then something has got > > rather confused somehow. What kind of fencing is in use here? > > Thank you very much, Steven! > > I have to say that it doesn't look like it's fenced: > > Node Sts Inc Joined Name > 1 M 364 2013-11-11 07:39:22 *** > 2 M 388 2013-11-26 03:43:01 *** > > Or shall I check somewhere else? Sorry if this question is a bit dumb. > Well the logs appear to suggest that one of the nodes at least has been fenced at some stage. > > I also noticed that gfs2_quotad was complaining too - that tends to be > > the first thing to complain when it cannot make progress. It is used for > > both statfs and quota, so runs periodically even when quotas are not in > > use. So that is just an indicator that things are slow, and the cause is > > most likely to be elsewhere. > > The other question is also what caused the node to try and fence the > > other one in the first place? That is not immediately clear from the > > logs. > > It seems that it has happened due to some traffic congestion. > Do you have separate networks for cluster traffic and other traffic? That will help to prevent this kind of thing. Also you can use tc to ensure that the cluster traffic will always get through. > > However you may well have to reboot one or more nodes in order to clear > > this condition, depending on exactly what the problem is. > > That's what I'd love to avoid. :) I can't reboot nodes, I need to find > out how to restart GFS2 without any reboot. Is it ever possible? > Well it depends what the problem is. I'm not sure that we've really got to the bottom of whats going wrong at the moment. > I have several processes in the state if "D" on both nodes, so I > understand I couldn't just unmount the stalled filesystem. > > > I did spot a note in the logs about the connection to the storage being > > lost, and that would certainly be enough to cause a problem on whichever > > node lost access. Are you running qdisk on that iSCSI storage? It would > > help if you could post your configuration, > > No, the storage itself is not a part of a cluster, it just an > iSCSI-target for 2 nodes. Is it a bad idea? > > Thank you. > It is perfectly ok, provided it is working and accessible at all times, Steve. -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster