RE: Freeze with cluster-2.03.11

Kadlecsik Jozsef <kadlec@xxxxxxxxxxxx> · Fri, 27 Mar 2009 07:47:06 +0100 (CET)

On Fri, 27 Mar 2009, Ben Yarwood wrote:

> Replaying a journal as below usually idicates a node has withdrawn from that
> file system I believe.  You should grep messages on all nodes for 'GFS', if
> any node is repoting errors with this fs then it will need rebooting/fencing
> before access to that fs can be achieved.

The failining node is fenced off. Here are the steps to reproduce the 
freeze of the node:

- all nodes are running and member of the cluster
- start the mailman queue manager: the node freezes
- the freezed node fenced off by a member of the cluster
- I can see log messages as I wrote in my first mail:

Mar 26 23:09:24 lxserv1 kernel: dlm: closing connection to node 1
Mar 26 23:09:25 lxserv1 kernel: GFS: fsid=kfki:home.1: jid=3: Trying to 
acquire journal lock...
[...]

- sometimes (but not always) the fencing machine freezes as well
  and then therefore fenced off
- third node has never freezed so far and the cluster thus remained
  in quorum
- fenced off machines restarted, join the cluster and work until I start
  the mailman queue manager

The daily backups of the whole GFS file systems are completed, so I assume 
it's not a filesystem corruption.

Best regards,
Jozsef
--
E-mail : kadlec@xxxxxxxxxxxx, kadlec@xxxxxxxxxxxxxxxxx
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: KFKI Research Institute for Particle and Nuclear Physics
         H-1525 Budapest 114, POB. 49, Hungary

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster