> So, depend on the node the state of cluster is different. The problems > are: > 1. after fencing nodes w11,w12,w13 the qourum is dissolved > 2. services that should run on left working nodes are going down. > 3. after bringing up fenced nodes the rgmanager has different view of > services on each node. > > I can't always reproduce this bug. Sometimes everything goes ok but it > happens quit rarely. > Sorry, I've forgotten to write that this is RHEL version 5.1 and packages are: cman-2.0.73-1.6.el5.test.bz327721 rgmanager-2.0.31-1.el5.bz430272 I did more tests and the problem with loosing quorum is always(?) after fencing node that has master role. Scenario 1: ============ 1. w11, w12, w13 are running on physical machine w1 2. w21, w22, w23 are running on physical machine w2 3. w11 is master " w11 qdiskd[1590]: <info> Assuming master role" 4. shutdown w2 5. w21, w22, w23 are fenced and everything is OK. The log file: Mar 12 12:17:36 w11 openais[1547]: [TOTEM] The token was lost in the OPERATIONAL state. ......... Mar 12 12:17:41 w11 kernel: dlm: closing connection to node 4 Mar 12 12:17:41 w11 clurgmgrd[1954]: <info> State change: w21.local DOWN Mar 12 12:17:41 w11 kernel: dlm: closing connection to node 5 Mar 12 12:17:41 w11 clurgmgrd[1954]: <info> State change: w22.local DOWN Mar 12 12:17:41 w11 kernel: dlm: closing connection to node 6 Mar 12 12:17:41 w11 clurgmgrd[1954]: <info> State change: w23.local DOWN Mar 12 12:17:41 w11 openais[1547]: [CLM ] New Configuration: ....... Mar 12 12:17:50 w11 fenced[1605]: fencing node "w23.local.polska.pl" Mar 12 12:17:51 w11 fenced[1605]: fence "w23.local.polska.pl" success Mar 12 12:17:56 w11 fenced[1605]: fencing node "w21.local.polska.pl" Mar 12 12:17:58 w11 fenced[1605]: fence "w21.local.polska.pl" success Mar 12 12:17:58 w11 clurgmgrd[1954]: <info> Node #4 fenced; continuing Mar 12 12:18:00 w11 qdiskd[1590]: <notice> Writing eviction notice for node 4 Mar 12 12:18:00 w11 qdiskd[1590]: <notice> Writing eviction notice for node 5 Mar 12 12:18:00 w11 qdiskd[1590]: <notice> Writing eviction notice for node 6 Mar 12 12:18:03 w11 fenced[1605]: fencing node "w22.local.polska.pl" Mar 12 12:18:03 w11 qdiskd[1590]: <notice> Node 4 evicted Mar 12 12:18:03 w11 qdiskd[1590]: <notice> Node 5 evicted Mar 12 12:18:03 w11 qdiskd[1590]: <notice> Node 6 evicted Mar 12 12:18:03 w11 fenced[1605]: fence "w22.local.polska.pl" success 6. w21, w22, w23 are up and service httpd_w11 i httpd_w21 are started Scenario 2 =============== 1. w11, w12, w13 are running on physical machine w1 2. w21, w22, w23 are running on physical machine w2 3. w11 is master " w11 qdiskd[1590]: <info> Assuming master role" 4. shutdown w1 The log on w21: Mar 12 14:00:58 w21 clurgmgrd[1975]: <info> State change: w11.local DOWN Mar 12 14:00:59 w21 clurgmgrd[1975]: <info> State change: w12.local DOWN Mar 12 14:00:59 w21 clurgmgrd[1975]: <info> State change: w13.local DOWN Mar 12 14:00:59 w21 clurgmgrd[1975]: <info> State change: /dev/xvdd1 UP Mar 12 14:00:59 w21 fenced[1626]: fencing node "w13.local " Mar 12 14:00:59 w21 clurgmgrd[1975]: <info> State change: /dev/xvdd1 UP Mar 12 14:01:07 w21 clurgmgrd[1975]: <info> State change: /dev/xvdd1 UP Mar 12 14:01:08 w21 clurgmgrd[1975]: <info> Waiting for node #1 to be fenced Mar 12 14:01:10 w21 fenced[1626]: fence "w13.local " success Mar 12 14:01:15 w21 fenced[1626]: fencing node "w12.local" Mar 12 14:01:15 w21 fenced[1626]: fence "w12.local" success Mar 12 14:01:20 w21 fenced[1626]: fencing node "w11.local" Mar 12 14:01:21 w21 fenced[1626]: fence "w11.local" success Mar 12 14:01:22 w21 clurgmgrd[1975]: <info> Node #1 fenced; continuing Mar 12 14:01:30 w21 qdiskd[1611]: <info> Assuming master role Mar 12 14:01:33 w21 qdiskd[1611]: <notice> Writing eviction notice for node 1 Mar 12 14:01:33 w21 qdiskd[1611]: <notice> Writing eviction notice for node 2 Mar 12 14:01:33 w21 qdiskd[1611]: <notice> Writing eviction notice for node 3 Mar 12 14:01:36 w21 qdiskd[1611]: <notice> Node 1 evicted Mar 12 14:01:36 w21 qdiskd[1611]: <notice> Node 2 evicted Mar 12 14:01:36 w21 qdiskd[1611]: <notice> Node 3 evicted 5. nodes w11, w12, w13 are up The log on w21: Mar 12 14:07:11 w21 clurgmgrd[1975]: <info> State change: /dev/xvdd1 UP Mar 12 14:07:11 w21 clurgmgrd[1975]: <info> State change: /dev/xvdd1 UP Mar 12 14:07:24 w21 clurgmgrd[1975]: <info> State change: /dev/xvdd1 UP Mar 12 14:07:33 w21 clurgmgrd[1975]: <info> State change: w12.local UP Mar 12 14:07:33 w21 clurgmgrd[1975]: <info> State change: /dev/xvdd1 UP Mar 12 14:07:35 w21 clurgmgrd[1975]: <info> State change: w11.localUP Mar 12 14:07:35 w21 clurgmgrd[1975]: <info> State change: /dev/xvdd1 UP Mar 12 14:07:48 w21 clurgmgrd[1975]: <info> State change: w13.local UP Mar 12 14:07:48 w21 clurgmgrd[1975]: <info> State change: /dev/xvdd1 UP 6. rgmanager doesn't start service on w11: The log on w11: Mar 12 12:42:16 w11 clurgmgrd[1971]: <notice> Resource Group Manager Starting Mar 12 12:42:16 w11 clurgmgrd[1971]: <info> Loading Service Data Mar 12 12:42:19 w11 clurgmgrd[1971]: <info> Initializing Services Mar 12 12:42:19 w11 clurgmgrd: [1971]: <err> script:httpd_script: stop of /etc/init.d/httpd failed (returned 143) Mar 12 12:42:19 w11 clurgmgrd[1971]: <notice> stop on script "httpd_script" returned 1 (generic error) Mar 12 12:42:19 w11 clurgmgrd[1971]: <info> Services Initialized Mar 12 12:42:19 w11 clurgmgrd[1971]: <info> State change: Local UP Mar 12 12:42:19 w11 clurgmgrd[1971]: <info> State change: w12.local UP Mar 12 12:42:19 w11 clurgmgrd[1971]: <info> State change: w21.local UP Mar 12 12:42:19 w11 clurgmgrd[1971]: <info> State change: w22.local UP Mar 12 12:42:19 w11 clurgmgrd[1971]: <info> State change: w23.local UP Mar 12 12:42:24 w11 clurgmgrd[1971]: <err> #34: Cannot get status for service service:httpd_w21 Mar 12 12:42:25 w11 clurgmgrd[1971]: <err> #34: Cannot get status for service service:httpd_w11 Mar 12 12:42:25 w11 clurgmgrd[1971]: <err> #34: Cannot get status for service service:httpd_w21 Mar 12 12:42:25 w11 clurgmgrd[1971]: <err> #34: Cannot get status for service service:httpd_w11 Mar 12 12:42:25 w11 clurgmgrd[1971]: <err> #34: Cannot get status for service service:httpd_w21 Mar 12 12:42:25 w11 clurgmgrd[1971]: <err> #34: Cannot get status for service service:httpd_w11 Mar 12 12:42:25 w11 clurgmgrd[1971]: <err> #34: Cannot get status for service service:httpd_w21 Mar 12 12:42:25 w11 clurgmgrd[1971]: <err> #34: Cannot get status for service service:httpd_w11 Mar 12 12:42:25 w11 qdiskd[1607]: <info> Initial score 1/1 Mar 12 12:42:25 w11 qdiskd[1607]: <info> Initialization complete Mar 12 12:42:25 w11 openais[1564]: [CMAN ] quorum device registered Mar 12 12:42:25 w11 qdiskd[1607]: <notice> Score sufficient for master operation (1/1; required=1); upgrading Mar 12 12:42:25 w11 clurgmgrd[1971]: <info> State change: /dev/xvdd1 UP Mar 12 12:42:25 w11 clurgmgrd[1971]: <err> #34: Cannot get status for service service:httpd_w21 Mar 12 12:42:25 w11 clurgmgrd[1971]: <err> #34: Cannot get status for service service:httpd_w11 Mar 12 12:42:25 w11 clurgmgrd[1971]: <err> #34: Cannot get status for service service:httpd_w21 The same is on w12, w13 nodes. Is it a bug or I make a mistake somewhere. Cheers Agnieszka Kukalowicz -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster