Found the bug report for this: https://bugzilla.redhat.com/show_bug.cgi?id=444529 It has been fixed, but not in my version. I need to determine
whether I can simply fence the affected nodes without compromising the cluster
(since the fence daemon itself is affected). Since our production cluster is
currently stable, I'll probably try this on a test cluster. Later we'll attempt a rolling upgrade of the cluster to get the
bug fix. From:
linux-cluster-bounces@xxxxxxxxxx [mailto:linux-cluster-bounces@xxxxxxxxxx] On
Behalf Of Jeff Sturm CentOS
5.2, 26-node cluster. Today
I restarted one node. It left the cluster, rebooted and joined the
cluster without incident. Everything is fine but… fenced has the CPU
pegged. No
useful log messages. strace says it is spinning on poll/recvfrom: poll([{fd=4,
events=POLLIN}, {fd=6, events=POLLIN, revents=POLLIN}, {fd=7, events=POLLIN},
{fd=8, events=POLLIN, revents=POLLNVAL}], 4, -1) = 2 recvfrom(5,
0x7fffb074ab40, 20, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) poll([{fd=4,
events=POLLIN}, {fd=6, events=POLLIN, revents=POLLIN}, {fd=7, events=POLLIN},
{fd=8, events=POLLIN, revents=POLLNVAL}], 4, -1) = 2 recvfrom(5,
0x7fffb074ab40, 20, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) Anything
else useful I can do to diagnose? What are the chances I can recover this
node nicely without making things worse? Any
help/ideas appreciated, Jeff |
-- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster