Il 22/03/2013 00:34, Digimer ha scritto: > On 03/21/2013 02:09 PM, Maurizio Giungato wrote: >> Il 21/03/2013 18:48, Maurizio Giungato ha scritto: >>> Il 21/03/2013 18:14, Digimer ha scritto: >>>> On 03/21/2013 01:11 PM, Maurizio Giungato wrote: >>>>> Hi guys, >>>>> >>>>> my goal is to create a reliable virtualization environment using >>>>> CentOS >>>>> 6.4 and KVM, I've three nodes and a clustered GFS2. >>>>> >>>>> The enviroment is up and working, but I'm worry for the >>>>> reliability, if >>>>> I turn the network interface down on one node to simulate a crash >>>>> (for >>>>> example on the node "node6.blade"): >>>>> >>>>> 1) GFS2 hangs (processes go in D state) until node6.blade get fenced >>>>> 2) not only node6.blade get fenced, but also node5.blade! >>>>> >>>>> Help me to save my last neurons! >>>>> >>>>> Thanks >>>>> Maurizio >>>> >>>> DLM, the distributed lock manager provided by the cluster, is >>>> designed to block when a known goes into an unknown state. It does >>>> not unblock until that node is confirmed to be fenced. This is by >>>> design. GFS2, rgmanager and clustered LVM all use DLM, so they will >>>> all block as well. >>>> >>>> As for why two nodes get fenced, you will need to share more about >>>> your configuration. >>>> >>> My configuration is very simple I attached cluster.conf and hosts >>> files. >>> This is the row I added in /etc/fstab: >>> /dev/mapper/KVM_IMAGES-VL_KVM_IMAGES /var/lib/libvirt/images gfs2 >>> defaults,noatime,nodiratime 0 0 >>> >>> I set also fallback_to_local_locking = 0 in lvm.conf (but nothing >>> change) >>> >>> PS: I had two virtualization enviroments working like a charm on >>> OCFS2, but since Centos 6.x I'm not able to install it, there is same >>> way to achieve the same results with GFS2 (with GFS2 sometime I've a >>> crash after only a "service network restart" [I've many interfaces >>> then this operation takes more than 10 seconds], with OCFS2 I've never >>> had this problem. >>> >>> Thanks >> I attached my logs from /var/log/cluster/* > > The configuration itself seems ok, though I think you can safely take > qdisk out to simplify things. That's neither here nor there though. > > This concerns me: > > Mar 21 19:00:14 fenced fence lama6.blade dev 0.0 agent > fence_bladecenter result: error from agent > Mar 21 19:00:14 fenced fence lama6.blade failed > > How are you triggering the failure(s)? The failed fence would > certainly help explain the delays. As I mentioned earlier, DLM is > designed to block when a node is in an unknowned state (failed but not > yet successfully fenced). > > As an aside; I do my HA VMs using clustered LVM LVs as the backing > storage behind the VMs. GFS2 is an excellent file system, but it is > expensive. Putting your VMs directly on the LV takes them out of the > equation I used 'service network stop' to simulate the failure, the node get fenced through fence_bladecenter (BladeCenter HW) Anyway, I took qdisk out and put GFS2 aside and now I've my VM on LVM LVs, I'm trying for many hours to reproduce the issue - only the node where I execute 'service network stop' get fenced - using fallback_to_local_locking = 0 in lvm.conf LVM LVs remain writable also while fencing take place All seems to work like a charm now. I'd like to understand what was happening. I'll try for same day before trusting it. Thank you so much. Maurizio _______________________________________________ CentOS-virt mailing list CentOS-virt@xxxxxxxxxx http://lists.centos.org/mailman/listinfo/centos-virt