Re: what should fence_xvm do if dom0 is down

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Oct 6, 2010 at 6:51 AM, Lon Hohberger <lhh@xxxxxxxxxx> wrote:
On 10/01/2010 02:11 AM, Joel Heenan wrote:
So just further to this I found a Red Hat bug about this exact issue:

https://bugzilla.redhat.com/show_bug.cgi?id=570373

And for me it works perfectly if the dom0 is fenced using fence_node on
the command line. However, if the host becomes unavailable then it is
not fenced, and from reading the fenced man page it seems this is
because there isn't a shared resource like clvm or gfs, so therefore the
cluster doesn't see a need to fence the host. This means subsequent
fence_xvm commands fail.

I guess I need to find a way to force fenced to operate without clvm and
fence dom0s?

Joel


fence_xvm/fence_xvmd is designed to handle two primary cases:

1) kill the misbehaving VM, or
2) Wait for the last-known owner of misbehaving VM to be dead.

Effectively, (2) occurs when the host cluster node dies and the host is subsequently fenced.

According to 570373, (2) stopped working at some point, but I haven't gotten enough information to adequately debug the problem.

If you have a cluster which exhibits this behavior, please contact me on FreeNode in #linux-cluster.

Hi Lon,

I was able to re-create this issue and capture the logs as per the bug, I will send them to your email address.

This is what it looks like from the guest:

"""
2010-10-06T23:26:31.902493+00:00 c013otin01-test fenced[1891]: c013otin07-test not a cluster member after 0 sec post_fail_delay
2010-10-06T23:26:31.902608+00:00 c013otin01-test fenced[1891]: fencing node "c013otin07-test"
2010-10-06T23:26:36.858569+00:00 c013otin01-test clurgmgrd[3519]: <info> Waiting for node #7 to be fenced
2010-10-06T23:27:04.440434+00:00 c013otin01-test fenced[1891]: agent "fence_xvm" reports: Timed out waiting for response
2010-10-06T23:27:04.440548+00:00 c013otin01-test ccsd[1862]: Attempt to close an unopened CCS descriptor (3035370).
2010-10-06T23:27:04.440595+00:00 c013otin01-test ccsd[1862]: Error while processing disconnect: Invalid request descriptor
2010-10-06T23:27:04.440633+00:00 c013otin01-test fenced[1891]: fence "c013otin07-test" failed
2010-10-06T23:27:09.444804+00:00 c013otin01-test fenced[1891]: fencing node "c013otin07-test"
2010-10-06T23:27:41.703023+00:00 c013otin01-test fenced[1891]: agent "fence_xvm" reports: Timed out waiting for response
2010-10-06T23:27:41.703146+00:00 c013otin01-test fenced[1891]: fence "c013otin07-test" failed
2010-10-06T23:27:46.703283+00:00 c013otin01-test fenced[1891]: fencing node "c013otin07-test"
2010-10-06T23:28:19.365666+00:00 c013otin01-test fenced[1891]: agent "fence_xvm" reports: Timed out waiting for response
2010-10-06T23:28:19.365967+00:00 c013otin01-test fenced[1891]: fence "c013otin07-test" failed
2010-10-06T23:28:24.365843+00:00 c013otin01-test fenced[1891]: fencing node "c013otin07-test"
2010-10-06T23:28:56.643939+00:00 c013otin01-test fenced[1891]: agent "fence_xvm" reports: Timed out waiting for response
2010-10-06T23:28:56.644226+00:00 c013otin01-test fenced[1891]: fence "c013otin07-test" failed
2010-10-06T23:29:01.644127+00:00 c013otin01-test fenced[1891]: fencing node "c013otin07-test"
2010-10-06T23:29:34.171420+00:00 c013otin01-test fenced[1891]: agent "fence_xvm" reports: Timed out waiting for response
2010-10-06T23:29:34.171507+00:00 c013otin01-test ccsd[1862]: Attempt to close an unopened CCS descriptor (3035970).
2010-10-06T23:29:34.171524+00:00 c013otin01-test ccsd[1862]: Error while processing disconnect: Invalid request descriptor
2010-10-06T23:29:34.171578+00:00 c013otin01-test fenced[1891]: fence "c013otin07-test" failed
2010-10-06T23:29:39.170656+00:00 c013otin01-test fenced[1891]: fencing node "c013otin07-test"
2010-10-06T23:30:01.418667+00:00 c013otin01-test rsync_policy_files: receiving file list ... done
2010-10-06T23:30:01.418699+00:00 c013otin01-test rsync_policy_files:
2010-10-06T23:30:01.418708+00:00 c013otin01-test rsync_policy_files: sent 30 bytes  received 12 bytes  84.00 bytes/sec
2010-10-06T23:30:01.418716+00:00 c013otin01-test rsync_policy_files: total size is 0  speedup is 0.00
2010-10-06T23:30:08.760903+00:00 c013otin01-test kernel: INFO: task clurgmgrd:25022 blocked for more than 120 seconds.
2010-10-06T23:30:08.760918+00:00 c013otin01-test kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2010-10-06T23:30:08.760923+00:00 c013otin01-test kernel: clurgmgrd     D ffff880001064b60     0 25022   3518         25023 25019 (NOTLB)
2010-10-06T23:30:08.760926+00:00 c013otin01-test kernel: ffff88016437fdb8  0000000000000286  0000000000000000  00000000ee8f8108
2010-10-06T23:30:08.760928+00:00 c013otin01-test kernel: 0000000000000008  ffff88098ed37080  ffff88097c9207a0  00000000000087b3
2010-10-06T23:30:08.760930+00:00 c013otin01-test kernel: ffff88098ed37268  ffffffff8029ed82
2010-10-06T23:30:08.760933+00:00 c013otin01-test kernel: Call Trace:
2010-10-06T23:30:08.760937+00:00 c013otin01-test kernel: [<ffffffff8029ed82>] futex_wake+0x50/0xd4
2010-10-06T23:30:08.760940+00:00 c013otin01-test kernel: [<ffffffff8023fe9c>] do_futex+0x2c2/0xcfb
2010-10-06T23:30:08.760942+00:00 c013otin01-test kernel: [<ffffffff802644cb>] __down_read+0x82/0x9a
2010-10-06T23:30:08.760945+00:00 c013otin01-test kernel: [<ffffffff8830b468>] :dlm:dlm_user_request+0x2d/0x175
"""

Here is what the fence_xvmd log shows on one dom0:

"""
Request to fence: c013otin07-test
Evaluating Domain: c013otin07-test   Last Owner: 7   State 1
Domain                   UUID                                 Owner State
------                   ----                                 ----- -----
c013operations01-test    9654e57b-7bb6-019e-937b-dc009f734a13 00001 00001
c013otin01-test          6fc9063b-5e9f-ef86-5ae2-8faa5fcde84a 00001 00001
c013summary01-test       10432e54-673f-8c61-d08d-591c42adce6e 00001 00002
Domain-0                 00000000-0000-0000-0000-000000000000 00001 00001
Storing c013operations01-test
Storing c013otin01-test
Storing c013summary01-test
Request to fence: c013otin07-test
Evaluating Domain: c013otin07-test   Last Owner: 7   State 1
"""

I did notice that group_tool state looks a bit borked:

"""
[root@dom0-01 ~]# group_tool
type             level name       id       state      
fence            0     default    00000000 JOIN_STOP_WAIT
[1 2 3 4 5 6 7 8 9 10]
dlm              1     rgmanager  00000000 JOIN_STOP_WAIT
[1 2 3 4 5 6 7 8 9 10]
"""

Is the group_tool output, the JOIN_STOP_WAIT the problem here? If so do you know how to fix it without rebooting all the nodes? I tried "fence_tool leave", and "fence_tool join" on all dom0's but that didn't resolve the problem.

Thanks

Joel
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

[Index of Archives]     [Corosync Cluster Engine]     [GFS]     [Linux Virtualization]     [Centos Virtualization]     [Centos]     [Linux RAID]     [Fedora Users]     [Fedora SELinux]     [Big List of Linux Books]     [Yosemite Camping]

  Powered by Linux