info on "A processor failed" message and fencing when going to single user mode

Gianluca Cecchi <gianluca.cecchi@xxxxxxxxx> · Mon, 5 Oct 2009 12:08:43 +0200

Hello,
2 nodes cluster  (virtfed and virtfedbis their names) with F11 x86_64 up2date as of today and without qdisk
cman-3.0.2-1.fc11.x86_64
openais-1.0.1-1.fc11.x86_64
corosync-1.0.0-1.fc11.x86_64
and kernel 2.6.30.8-64.fc11.x86_64

I was in a situation where both nodes up, after virtfedbis hust restarted and starting a service 
Inside one of its resources there is a loop where it tests availability of a file and so it was in starting of this service, but infra ws up, as of this messages:

Oct  5 11:44:39 virtfed corosync[4684]:   [CLM   ] CLM CONFIGURATION CHANGE
Oct  5 11:44:39 virtfed corosync[4684]:   [CLM   ] New Configuration:
Oct  5 11:44:39 virtfed corosync[4684]:   [CLM   ] #011r(0) ip(192.168.16.101) 

Oct  5 11:44:39 virtfed corosync[4684]:   [CLM   ] Members Left:
Oct  5 11:44:39 virtfed corosync[4684]:   [CLM   ] #011r(0) ip(192.168.16.102) 
Oct  5 11:44:39 virtfed corosync[4684]:   [CLM   ] Members Joined:

Oct  5 11:44:39 virtfed corosync[4684]:   [QUORUM] This node is within the primary component and will provide service.
Oct  5 11:44:39 virtfed corosync[4684]:   [QUORUM] Members[1]: 
Oct  5 11:44:39 virtfed corosync[4684]:   [QUORUM]     1 

Oct  5 11:44:39 virtfed corosync[4684]:   [CLM   ] CLM CONFIGURATION CHANGE
Oct  5 11:44:39 virtfed corosync[4684]:   [CLM   ] New Configuration:
Oct  5 11:44:39 virtfed corosync[4684]:   [CLM   ] #011r(0) ip(192.168.16.101) 

Oct  5 11:44:39 virtfed corosync[4684]:   [CLM   ] Members Left:
Oct  5 11:44:39 virtfed corosync[4684]:   [CLM   ] Members Joined:
Oct  5 11:44:39 virtfed corosync[4684]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.

Oct  5 11:44:39 virtfed kernel: dlm: closing connection to node 2
Oct  5 11:44:39 virtfed corosync[4684]:   [MAIN  ] Completed service synchronization, ready to provide service.

So now they are at this condition, reported by virtfedbis

[root@virtfedbis ~]# clustat 
Cluster Status for kvm @ Mon Oct  5 11:49:27 2009
Member Status: Quorate

 Member Name                                                ID   Status
 ------ ----                                                ---- ------

 kvm1                                                           1 Online, rgmanager
 kvm2                                                           2 Online, Local, rgmanager

 Service Name                                      Owner (Last)                                      State         

 ------- ----                                      ----- ------                                      -----         
 service:DRBDNODE1                                 kvm1                                              started       

 service:DRBDNODE2                                 kvm2                                              starting      

I realize that I forgot a thing so that after 10 attempts DRBDNODE2 service would not come up and so I decide to put

virtfedbis in single user mode, so that I run on it

shutdown 0

I would expect virtfedbis to leave cleanly the cluster, instead it is fenced and rebooted (via fence_ilo agent)

On virtfed these are the messages:

Oct  5 11:49:49 virtfed corosync[4684]:   [TOTEM ] A processor failed, forming new configuration.
Oct  5 11:49:54 virtfed corosync[4684]:   [CLM   ] CLM CONFIGURATION CHANGE
Oct  5 11:49:54 virtfed corosync[4684]:   [CLM   ] New Configuration:

Oct  5 11:49:54 virtfed corosync[4684]:   [CLM   ] #011r(0) ip(192.168.16.101) 
Oct  5 11:49:54 virtfed corosync[4684]:   [CLM   ] Members Left:
Oct  5 11:49:54 virtfed corosync[4684]:   [CLM   ] #011r(0) ip(192.168.16.102) 

Oct  5 11:49:54 virtfed corosync[4684]:   [CLM   ] Members Joined:
Oct  5 11:49:54 virtfed corosync[4684]:   [QUORUM] This node is within the primary component and will provide service.
Oct  5 11:49:54 virtfed corosync[4684]:   [QUORUM] Members[1]: 

Oct  5 11:49:54 virtfed corosync[4684]:   [QUORUM]     1 
Oct  5 11:49:54 virtfed corosync[4684]:   [CLM   ] CLM CONFIGURATION CHANGE
Oct  5 11:49:54 virtfed corosync[4684]:   [CLM   ] New Configuration:
Oct  5 11:49:54 virtfed corosync[4684]:   [CLM   ] #011r(0) ip(192.168.16.101) 

Oct  5 11:49:54 virtfed corosync[4684]:   [CLM   ] Members Left:
Oct  5 11:49:54 virtfed corosync[4684]:   [CLM   ] Members Joined:
Oct  5 11:49:54 virtfed corosync[4684]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.

Oct  5 11:49:54 virtfed corosync[4684]:   [MAIN  ] Completed service synchronization, ready to provide service.
Oct  5 11:49:54 virtfed kernel: dlm: closing connection to node 2
Oct  5 11:49:54 virtfed fenced[4742]: fencing node kvm2

Oct  5 11:49:54 virtfed rgmanager[5496]: State change: kvm2 DOWN
Oct  5 11:50:26 virtfed fenced[4742]: fence kvm2 success

What I find on virtfedbis after restart in /var/log/cluster directory is this:

corosync.log

Oct 05 11:49:49 corosync [TOTEM ] A processor failed, forming new configuration.
Oct 05 11:49:49 corosync [TOTEM ] The network interface is down.
Oct 05 11:49:54 corosync [CLM   ] CLM CONFIGURATION CHANGE
Oct 05 11:49:54 corosync [CLM   ] New Configuration:

Oct 05 11:49:54 corosync [CLM   ]       r(0) ip(127.0.0.1) 
Oct 05 11:49:54 corosync [CLM   ] Members Left:
Oct 05 11:49:54 corosync [CLM   ]       r(0) ip(192.168.16.102) 
Oct 05 11:49:54 corosync [CLM   ] Members Joined:

Oct 05 11:49:54 corosync [QUORUM] This node is within the primary component and will provide service.
Oct 05 11:49:54 corosync [QUORUM] Members[1]: 
Oct 05 11:49:54 corosync [QUORUM]     1 
Oct 05 11:49:54 corosync [CLM   ] CLM CONFIGURATION CHANGE

Oct 05 11:49:54 corosync [CLM   ] New Configuration:
Oct 05 11:49:54 corosync [CLM   ]       r(0) ip(127.0.0.1) 
Oct 05 11:49:54 corosync [CLM   ] Members Left:
Oct 05 11:49:54 corosync [CLM   ] Members Joined:

Oct 05 11:49:54 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Oct 05 11:49:54 corosync [CMAN  ] Killing node kvm2 because it has rejoined the cluster with existing state

I think there is something wrong in this behaviour....
This is a test cluster so I have no qdisk ..... 
Is this the cause inherent with my config that has:
<cman expected_votes="1" two_node="1"/>

        <fence_daemon clean_start="1" post_fail_delay="0" post_join_delay="20"/>

In general, if I do a shutdown -r now an one of the two nodes I have not thsi kind of problems.....

Thanks for any insight,
Gianluca

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster