On Tue, Nov 18, 2008 at 05:14:38PM +1030, Tom Lanyon wrote: > On 15/11/2008, at 8:35 AM, David Teigland wrote: > > >On Fri, Nov 14, 2008 at 09:53:13PM +0000, Nuno Fernandes wrote: > >>>On Fri, Nov 14, 2008 at 10:00:13AM +0000, Nuno Fernandes wrote: > >>>dlm recovery appears to be stuck; this is usually due to a problem > >>>at the > >>>network level. The recovery seems to be caused by a node starting > >>>clvmd. > >>Hi, > >> > >>I don't know if it helps, but groupd is using all available CPU, but > >>only in 2 of the nodes. > > > >That sounds like https://bugzilla.redhat.com/show_bug.cgi?id=444529 > >which is fixed in 5.3. I suspect that's the cause of you're problems. > > > >Dave > > > We seem to be having the same problem on a 5 node virtual cluster > where 3 of the nodes share a GFS mount. > > A backup script runs on one node which does some heavy reads + writes > to this mount at which point all three nodes jump to 100% cpu (90% > iowait on the machine that is doing the backup, 100% system on the > other two) and all LVM VGs, LVs and GFS mounts lock up. Which process was using 100% cpu? If it was groupd, fenced, dlm_controld or gfs_controld, then yes it may be the same problem. > Is there anything that could be tuned here to avoid this issue until a > bug fix is released? I don't think there's any way to avoid the bug in the bz I referenced. Dave -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster