On Tue, Jan 12, 2010 at 11:21:14AM -0500, Evan Broder wrote: > On Tue, Jan 12, 2010 at 3:54 AM, Christine Caulfield > <ccaulfie@xxxxxxxxxx> wrote: > > On 11/01/10 09:38, Christine Caulfield wrote: > >> > >> On 11/01/10 09:32, Evan Broder wrote: > >>> > >>> On Mon, Jan 11, 2010 at 4:03 AM, Christine Caulfield > >>> <ccaulfie@xxxxxxxxxx> wrote: > >>>> > >>>> On 08/01/10 22:58, Evan Broder wrote: > >>>>> > >>>>> [please preserve the CC when replying, thanks] > >>>>> > >>>>> Hi - > >>>>> We're attempting to setup a clvm (2.02.56) cluster using OpenAIS > >>>>> (1.1.1) and Corosync (1.1.2). We've gotten bitten hard in the past by > >>>>> crashes leaving DLM state around and forcing us to reboot our nodes, > >>>>> so we're specifically looking for a solution that doesn't involve > >>>>> in-kernel locking. > >>>>> > >>>>> We're also running the Pacemaker OpenAIS service, as we're hoping to > >>>>> use it for management of some other resources going forward. > >>>>> > >>>>> We've managed to form the OpenAIS cluster, and get clvmd running on > >>>>> both of our nodes. Operations using LVM succeed, so long as only one > >>>>> operation runs at a time. However, if we attempt to run two operations > >>>>> (say, one lvcreate on each host) at a time, they both hang, and both > >>>>> clvmd processes appear to deadlock. > >>>>> > >>>>> When they deadlock, it doesn't appear to affect the other clustering > >>>>> processes - both corosync and pacemaker still report a fully formed > >>>>> cluster, so it seems the issue is localized to clvmd. > >>>>> > >>>>> I've looked at logs from corosync and pacemaker, and I've straced > >>>>> various processes, but I don't want to blast a bunch of useless > >>>>> information at the list. What information can I provide to make it > >>>>> easier to debug and fix this deadlock? > >>>>> > >>>> > >>>> To start with, the best logging to produce is the clvmd logs which > >>>> can be > >>>> got with clvmd -d (see the man page for details). Ideally these > >>>> should be > >>>> from all nodes in the cluster so they can be correlated. If you're still > >>>> using DLM then a dlm lock dump from all nodes is often helpful in > >>>> conjunction with the clvmd logs. > >>> > >>> Sure, no problem. I've posted the logs from clvmd on both processes in > >>> <http://web.mit.edu/broder/Public/clvmd/>. I've annotated them at a > >>> few points with what I was doing - the annotations all start with " > >>>>> > >>>>> ", so they should be easy to spot. > > > > > > Ironically it looks like a bug in the clvmd-openais code. I can reproduce it > > on my systems here. I don't see the problem when using the dlm! > > > > Can you try -Icorosync and see if that helps? In the meantime I'll have a > > look at the openais bits to try and find out what is wrong. > > > > Chrissie > > > > I'll see what we can pull together, but the nodes running the clvm > cluster are also Xen dom0's. They're currently running on (Ubuntu > Hardy's) 2.6.24, so upgrading them to something new enough to support > DLM 3 would be...challenging. > > It would be much, much better for us if we could get clvmd-openais working. > > Is there any chance this would work better if we dropped back to > openais whitetank instead of corosync + openais wilson? No. The LCK service in the wilson branch was only partially implemented and contained a number of bugs. You'll need at least openais 1.0 to have a functional LCK service. Ryan -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster