Re: Deadlock when using clvmd + OpenAIS + Corosync

"Ryan O'Hara" <rohara@xxxxxxxxxx> · Wed, 13 Jan 2010 11:59:16 -0600

On Tue, Jan 12, 2010 at 11:21:14AM -0500, Evan Broder wrote:
> On Tue, Jan 12, 2010 at 3:54 AM, Christine Caulfield
> <ccaulfie@xxxxxxxxxx> wrote:
> > On 11/01/10 09:38, Christine Caulfield wrote:
> >>
> >> On 11/01/10 09:32, Evan Broder wrote:
> >>>
> >>> On Mon, Jan 11, 2010 at 4:03 AM, Christine Caulfield
> >>> <ccaulfie@xxxxxxxxxx> wrote:
> >>>>
> >>>> On 08/01/10 22:58, Evan Broder wrote:
> >>>>>
> >>>>> [please preserve the CC when replying, thanks]
> >>>>>
> >>>>> Hi -
> >>>>> We're attempting to setup a clvm (2.02.56) cluster using OpenAIS
> >>>>> (1.1.1) and Corosync (1.1.2). We've gotten bitten hard in the past by
> >>>>> crashes leaving DLM state around and forcing us to reboot our nodes,
> >>>>> so we're specifically looking for a solution that doesn't involve
> >>>>> in-kernel locking.
> >>>>>
> >>>>> We're also running the Pacemaker OpenAIS service, as we're hoping to
> >>>>> use it for management of some other resources going forward.
> >>>>>
> >>>>> We've managed to form the OpenAIS cluster, and get clvmd running on
> >>>>> both of our nodes. Operations using LVM succeed, so long as only one
> >>>>> operation runs at a time. However, if we attempt to run two operations
> >>>>> (say, one lvcreate on each host) at a time, they both hang, and both
> >>>>> clvmd processes appear to deadlock.
> >>>>>
> >>>>> When they deadlock, it doesn't appear to affect the other clustering
> >>>>> processes - both corosync and pacemaker still report a fully formed
> >>>>> cluster, so it seems the issue is localized to clvmd.
> >>>>>
> >>>>> I've looked at logs from corosync and pacemaker, and I've straced
> >>>>> various processes, but I don't want to blast a bunch of useless
> >>>>> information at the list. What information can I provide to make it
> >>>>> easier to debug and fix this deadlock?
> >>>>>
> >>>>
> >>>> To start with, the best logging to produce is the clvmd logs which
> >>>> can be
> >>>> got with clvmd -d (see the man page for details). Ideally these
> >>>> should be
> >>>> from all nodes in the cluster so they can be correlated. If you're still
> >>>> using DLM then a dlm lock dump from all nodes is often helpful in
> >>>> conjunction with the clvmd logs.
> >>>
> >>> Sure, no problem. I've posted the logs from clvmd on both processes in
> >>> <http://web.mit.edu/broder/Public/clvmd/>. I've annotated them at a
> >>> few points with what I was doing - the annotations all start with "
> >>>>>
> >>>>> ", so they should be easy to spot.
> >
> >
> > Ironically it looks like a bug in the clvmd-openais code. I can reproduce it
> > on my systems here. I don't see the problem when using the dlm!
> >
> > Can you try -Icorosync and see if that helps? In the meantime I'll have a
> > look at the openais bits to try and find out what is wrong.
> >
> > Chrissie
> >
> 
> I'll see what we can pull together, but the nodes running the clvm
> cluster are also Xen dom0's. They're currently running on (Ubuntu
> Hardy's) 2.6.24, so upgrading them to something new enough to support
> DLM 3 would be...challenging.
> 
> It would be much, much better for us if we could get clvmd-openais working.
> 
> Is there any chance this would work better if we dropped back to
> openais whitetank instead of corosync + openais wilson?

No. The LCK service in the wilson branch was only partially
implemented and contained a number of bugs. You'll need at least
openais 1.0 to have a functional LCK service.

Ryan

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster