Problem with "<emerg> #1: Quorum Dissolved"

Agnieszka Kukałowicz <qqlka@xxxxxxx> · Tue, 11 Mar 2008 16:59:55 +0100

Hi, 

During some tests I got errors like "<emerg> #1: Quorum Dissolved" ...

My cluster has 6 nodes that are virtual services running on 2 physical
nodes. On each node there is 3 virtual services:
  Member Name                 ID   Status
  ------ ----                 ---- ------
  w2.local                    1 Online, Local, rgmanager
  w1.local                    2 Online, rgmanager

  Service Name         Owner (Last)         State
  ------- ----         ----- ------         -----
  vm:VM_Work11_RHEL51  w1.local             started
  vm:VM_Work12_RHEL51  w1.local             started
  vm:VM_Work13_RHEL51  w1.local             started
  vm:VM_Work21_RHEL51  w2.local             started
  vm:VM_Work22_RHEL51  w2.local             started
  vm:VM_Work23_RHEL51  w2.local             started

On the 6-node cluster I runnig 2 httpd services (in restricted failover
domain).

Member Status: Quorate

  Member Name                        ID   Status
  ------ ----                        ---- ------
  w11.local                   1 Online, rgmanager
  w12.local                   2 Online, rgmanager
  w13.local                   3 Online, rgmanager
  w21.local                   4 Online, Local, rgmanager
  w22.local                   5 Online, rgmanager
  w23.local                   6 Online, rgmanager
  /dev/xvdd1                            0 Online, Quorum Disk

  Service Name         Owner (Last)     	State
  ------- ----         ----- ------		-----
  service: httpd_w11 w11.local          started
  service: httpd_w21 w21.local          started

After shutting down w11.local node this cluster should run normallly
because there is still qourum ( qourum device has 5 votes). The
httpd_w11 service should be down but the httpd_w21 service should be up
(the w21.local node is runnig). That not happens. 

On w21.local I get error that qourum is dissolved and cluster is not
quorate. It takes a time the cluster is again qourate. During the time
rgmanager is not working and service httpd_w21 is down. After gaining
qourum I get error: 
Mar 11 16:16:20 w21 clurgmgrd[1946]: <err> #34: Cannot get status for
service service:httpd_w21

When all members of cluster are online the clustat shows:

1. on w21.local
Member Status: Quorate

  Member Name                        ID   Status
  ------ ----                        ---- ------
  w11.local                   1 Online, rgmanager
  w12.local                   2 Online, rgmanager
  w13.local                   3 Online, rgmanager
  w21.local                   4 Online, Local, rgmanager
  w22.local                   5 Online, rgmanager
  w23.local                   6 Online, rgmanager
  /dev/xvdd1                            0 Online, Quorum Disk

  Service Name         Owner (Last)          State
  ------- ----         ----- ------          -----
  service:httpd_w11 	w11.local            started

2. on w11.local, w12.local, w13.local that were fenced:

Member Status: Quorate

  Member Name                  ID   Status
  ------ ----                  ---- ------
  w11.local                   1 Online, Local
  w12.local                   2 Online
  w13.local                   3 Online
  w21.local                   4 Online
  w22.local                   5 Online
  w23.local                   6 Online
  /dev/xvdd1                            0 Online, Quorum Disk

clustat shows that rgmanager is not running. But in the logs there is:
Mar 11 14:56:27 w11 Mar 11 14:56:37 w11 clurgmgrd[1942]: <err> #34:
Cannot get status for service service:httpd_w11
Mar 11 14:56:37 w11 clurgmgrd[1942]: <err> #34: Cannot get status for
service service:httpd_w21 clurgmgrd[1942]: <notice> Resource Group
Manager Starting

3. on w23.local:

Member Status: Quorate

  Member Name                        ID   Status
  ------ ----                        ---- ------
  w11.local                   1 Online, rgmanager
  w12.local                   2 Online, rgmanager
  w13.local                   3 Online, rgmanager
  w21.local                   4 Online, rgmanager
  w22.local                   5 Online, rgmanager
  w23.local                   6 Online, Local, rgmanager
  /dev/xvdd1                            0 Online, Quorum Disk

  Service Name         Owner (Last)      State
  ------- ----         ----- ------      -----
  service:httpd_w11 w11.local            started
  service:httpd_w21 w21.local            started

So, depend on the node the state of cluster is different. The problems
are:
1. after fencing nodes w11,w12,w13 the qourum is dissolved 
2. services that should run on left working nodes are going down.
3. after bringing up fenced nodes the rgmanager has different view of
services on each node. 

I can't always reproduce this bug. Sometimes everything goes ok but it
happens quit rarely.

Cheers
Agnieszka Kukalowicz

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster