On Tue, 2009-10-20 at 10:07 +0100, Steven Whitehouse wrote: > Hi, > > On Mon, 2009-10-19 at 16:30 -0600, Kai Meyer wrote: > > Ok, so our lab test results have turned up some fun events. > > > > Firstly, we were able to duplicate the invalid metadata block exactly > > under the following circumstances: > > > > We wanted to monkey with the VLan that fenced/openais ran on. We failed > > miserably causing all three of my test nodes to believe that they became > > lone islands in the cluster, unable to get enough votes themselves to > > fence anybody. So we chose to simply power cycle the nodes with out > > trying to gracefully leave the cluster or reboot (they are diskless > > servers with NFS root filesystems so the GFS2 filesystem is the only > > thing we were risking corruption with.) After the nodes came back > > online, we began to see the same random reboots and filesystem withdraws > > within 24 hours. The filesystem taht went into production that > > eventually hit these errors was likely not reformatted just before > > putting into production, and I believe it is highly likely that the last > > format done on that production filesystem was done while we were still > > doing testing. I hope that as we continue in our lab, we can reproduce > > the same circumstances, and give you a step-by-step that will cause this > > issue. It'll make me feel much better about our current GFS2 filesystem > > that was created and unmounted cleanly by a single node, and then put > > straight into production, and has been only mounted once by our current > > production servers since it was formatted. > > > That very interesting information. We are not there yet, but there are a > number of useful hints in that. Any further information you are able to > gather would be very interesting. > > > Secondly, the way our VMs are doing I/O, we have found the cluster.conf > > configuration settings: > > <dlm plock_ownership="1" plock_rate_limit="0"/> > > <gfs_controld plock_rate_limit="0"/> > > have lowered our %wa times from ~60% to ~30% utilization. I am curious > > why the locking deamon is set to default to such a low number by default > > (100). Adding these two parameters in the cluster.conf raised our locks > > per second with the ping_pong binary from 93 to 3000+ in our 5 node > > cluster. Our throughput doesn't seem to improve by either upping the > > locking limit or setting up jumbo frames, but processes spend much less > > time in I/O wait state than before (if my munin graphs are believable). > > How likely is it that the low locking rate had a hand in causing the > > filesystem withdraws and 'invalid metadata block' errors? > > > I think there would be an argument for setting the default rate limit to > 0 (i.e. off) since we seem to spend so much time telling people to turn > off this particular feature. The reason that it was added is that under > certain circumstances it is possible to flood the network with plock > requests resulting in the blocking of openais traffic (so the cluster > thinks its been partitioned). > > I've not seen or heard of any recent reports of this, though, but that > is the original reason the feature was added. Most applications tend to > be I/O bound rather than (fcntl) lock bound anyway, so that the chances > of it being a problem are fairly slim. > The reason the limiting was added was because the IPC system in original openais in fc6/rhel5.0 would disconnect heavy users of ipc connections, triggering a fencing operation of the node. That problem has been resolved since 5.3.z (also f11+). > Setting jumbo frames won't help as the issue is one of latency rather > than throughput (performance-wise). Using a low-latency interconnect in > the cluster should help fcntl lock performance though. > jumbo frames reduces latency AND increases throughput from origination to delivery for heavy message traffic. For very light message traffic latency is increased but throughput is still improved. > The locking rate should have no bearing on the filesystem itself. The > locking (fcntl only this refers to, btw) is performed in userspace by > dlm_controld (gfs_controld on older clusters) and merely passed through > the filesystem. The fcntl code is identical between gfs1 and gfs2. > > > I'm still not completely confident I won't see this happen again on my > > production servers. I'm hoping you can help me with that. > > > Yes, so am I :-) It sounds like we are making progress if we can reduce > the search space for the problem and it sounds very much from your > message as if you believe that it is a recovery issue, and it sounds > plausible to me, > > > -- > Linux-cluster mailing list > Linux-cluster@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster