On Fri, 27 Mar 2009, Bob Peterson wrote:
> | Combing through the log files I found the following:
> |
> | Mar 27 13:31:56 lxserv0 fenced[3833]: web1-gfs not a cluster member
> | after 0 sec post_fail_delay
> | Mar 27 13:31:56 lxserv0 fenced[3833]: fencing node "web1-gfs"
> | Mar 27 13:31:56 lxserv0 fenced[3833]: can't get node number for node
> | e1÷?e1÷?
> | Mar 27 13:31:56 lxserv0 fenced[3833]: fence "web1-gfs" success
> |
> | The line saying "can't get node number for node e1÷?e1÷?" might be
> | innocent, but looks suspicious. Why fenced could not get the victim
> | name?
>
> This leads me to believe that this is a cluster problem,
> not a GFS problem. If a node is fenced, GFS can't give out
> new locks until the fenced node is properly deal with by
> the cluster software. Therefore, GFS can appear to hang until
> the dead node is resolved. Did web1-gfs get rebooted and
> brought back in to the cluster?
Yes. Probably it's worth to summarize what's happening here:
- Full, healthy-looking cluster with all of the five nodes joined
runs smoothly.
- One node freezes out of the blue; it can reliably be triggered
anytime by starting mailman, which works over GFS.
- The freezed node gets fenced off - I assume it's not reversed and
the node freezes *because* it got fenced.
As we use AOE, the fencing happens at AOE level and the node is *not*
rebooted automatically but the access right to the AOE devices are
withdrawn. Freeze means there's no response at the console. The node still
answers to ping, but nothing else. There's no a single error message in
the kernel log or at the console screen.
GFS does not freeze at all. There's a short pause, but then it works fine
until the quorum is lost as more nodes fall out.
We tried vanilla kernels 2.6.27.14 and 2.6.27.21 with the same results so
I don't think it's a kernel problem. It >looks< either a GFS kernel module
or an openais problem, if latter (as the victim machine fenced off) can
cause system freeze.
In daytime (active users) it was like an infenction: in ten minutes after
bringing back the machines one failed, then shortly after another too.
Now, since 17:22 (more than three hours) the cluster runs smoothly, but
it's lightly used. However a node can be killed anytime by starting that
damned mailman, which should run.
Best regards,
Jozsef
--
E-mail : kadlec@xxxxxxxxxxxx, kadlec@xxxxxxxxxxxxxxxxx
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: KFKI Research Institute for Particle and Nuclear Physics
H-1525 Budapest 114, POB. 49, Hungary
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster