Hi, On Fri, 2011-12-30 at 21:37 +0100, Stevo Slavić wrote: > Pulling the cables between shared storage and foo01, foo01 gets > fenced. Here is some info from foo02 about shared storage and dlm > debug (lock file seems to remain locked) > > root@foo02:-//data/activemq_data#ls -li > total 276 > 66467 -rw-r--r-- 1 root root 33030144 Dec 30 16:32 db-1.log > 66468 -rw-r--r-- 1 root root 73728 Dec 30 16:24 db.data > 66470 -rw-r--r-- 1 root root 53344 Dec 30 16:24 db.redo > 128014 -rw-r--r-- 1 root root 0 Dec 30 19:49 dummy > 66466 -rw-r--r-- 1 root root 0 Dec 30 16:23 lock > root@foo02:-//data/activemq_data#grep -A 7 -i > 103a2 /debug/dlm/activemq > Resource ffff81090faf96c0 Name (len=24) " 2 103a2" > Master Copy > Granted Queue > 03d10002 PR Remote: 1 00c80001 > 00e00001 PR > Conversion Queue > Waiting Queue > -- > Resource ffff81090faf97c0 Name (len=24) " 5 103a2" > Master Copy > Granted Queue > 03c30003 PR Remote: 1 039a0001 > 03550001 PR > Conversion Queue > Waiting Queue > > > Are there some docs for interpreting this dlm debug output? > > Not as such I think. It sounds like the issue is recovery related. Are there any messages which indicate what might be going on? Once the failed node has been fenced, then recovery should proceed fairly soon afterwards, Steve. > Regards, > Stevo. > > On Fri, Dec 30, 2011 at 9:23 PM, Digimer <linux@xxxxxxxxxxx> wrote: > On 12/30/2011 03:08 PM, Stevo Slavić wrote: > > Hi Digimer and Yvette, > > > > Thanks for tips! I don't doubt reliability of the > technology, just want > > to make sure it is configured well. > > > > After fencing a node that held a lock on a file on shared > storage, lock > > remains, and non-fenced node cannot take over the lock on > that file. > > Wondering how can one check which process (from which node > if possible) > > is holding a lock on a file on shared storage. > > dlm should have taken care of releasing the lock once node > got fenced, > > right? > > > > Regards, > > Stevo. > > > After a successful fence call, DLM will clean up any locks > held by the > lost node. That's why it's so critical that the fence action > succeeded > (ie: test-test-test). If a node doesn't actually die in a > fence, but the > cluster thinks it did, and somehow the lost node returns, the > lost node > will think it's locks are still valid and modify shared > storage, leading > to near-certain data corruption. > > It's all perfectly safe, provided you've tested your fencing > properly. :) > > Yvette, > > You might be right on the 'noatime' implying 'nodiratime'... > I add > both out of habit. > > -- > Digimer > E-Mail: digimer@xxxxxxxxxxx > Freenode handle: digimer > Papers and Projects: http://alteeve.com > Node Assassin: http://nodeassassin.org > "omg my singularity battery is dead again. > stupid hawking radiation." - epitron > > > -- > Linux-cluster mailing list > Linux-cluster@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster