On Wed, Oct 6, 2010 at 6:51 AM, Lon Hohberger <lhh@xxxxxxxxxx> wrote:
On 10/01/2010 02:11 AM, Joel Heenan wrote:fence_xvm/fence_xvmd is designed to handle two primary cases:
So just further to this I found a Red Hat bug about this exact issue:
https://bugzilla.redhat.com/show_bug.cgi?id=570373
And for me it works perfectly if the dom0 is fenced using fence_node on
the command line. However, if the host becomes unavailable then it is
not fenced, and from reading the fenced man page it seems this is
because there isn't a shared resource like clvm or gfs, so therefore the
cluster doesn't see a need to fence the host. This means subsequent
fence_xvm commands fail.
I guess I need to find a way to force fenced to operate without clvm and
fence dom0s?
Joel
1) kill the misbehaving VM, or
2) Wait for the last-known owner of misbehaving VM to be dead.
Effectively, (2) occurs when the host cluster node dies and the host is subsequently fenced.
According to 570373, (2) stopped working at some point, but I haven't gotten enough information to adequately debug the problem.
If you have a cluster which exhibits this behavior, please contact me on FreeNode in #linux-cluster.
Hi Lon,
I was able to re-create this issue and capture the logs as per the bug, I will send them to your email address.
This is what it looks like from the guest:
"""
2010-10-06T23:26:31.902493+00:00 c013otin01-test fenced[1891]: c013otin07-test not a cluster member after 0 sec post_fail_delay
2010-10-06T23:26:31.902608+00:00 c013otin01-test fenced[1891]: fencing node "c013otin07-test"
2010-10-06T23:26:36.858569+00:00 c013otin01-test clurgmgrd[3519]: <info> Waiting for node #7 to be fenced
2010-10-06T23:27:04.440434+00:00 c013otin01-test fenced[1891]: agent "fence_xvm" reports: Timed out waiting for response
2010-10-06T23:27:04.440548+00:00 c013otin01-test ccsd[1862]: Attempt to close an unopened CCS descriptor (3035370).
2010-10-06T23:27:04.440595+00:00 c013otin01-test ccsd[1862]: Error while processing disconnect: Invalid request descriptor
2010-10-06T23:27:04.440633+00:00 c013otin01-test fenced[1891]: fence "c013otin07-test" failed
2010-10-06T23:27:09.444804+00:00 c013otin01-test fenced[1891]: fencing node "c013otin07-test"
2010-10-06T23:27:41.703023+00:00 c013otin01-test fenced[1891]: agent "fence_xvm" reports: Timed out waiting for response
2010-10-06T23:27:41.703146+00:00 c013otin01-test fenced[1891]: fence "c013otin07-test" failed
2010-10-06T23:27:46.703283+00:00 c013otin01-test fenced[1891]: fencing node "c013otin07-test"
2010-10-06T23:28:19.365666+00:00 c013otin01-test fenced[1891]: agent "fence_xvm" reports: Timed out waiting for response
2010-10-06T23:28:19.365967+00:00 c013otin01-test fenced[1891]: fence "c013otin07-test" failed
2010-10-06T23:28:24.365843+00:00 c013otin01-test fenced[1891]: fencing node "c013otin07-test"
2010-10-06T23:28:56.643939+00:00 c013otin01-test fenced[1891]: agent "fence_xvm" reports: Timed out waiting for response
2010-10-06T23:28:56.644226+00:00 c013otin01-test fenced[1891]: fence "c013otin07-test" failed
2010-10-06T23:29:01.644127+00:00 c013otin01-test fenced[1891]: fencing node "c013otin07-test"
2010-10-06T23:29:34.171420+00:00 c013otin01-test fenced[1891]: agent "fence_xvm" reports: Timed out waiting for response
2010-10-06T23:29:34.171507+00:00 c013otin01-test ccsd[1862]: Attempt to close an unopened CCS descriptor (3035970).
2010-10-06T23:29:34.171524+00:00 c013otin01-test ccsd[1862]: Error while processing disconnect: Invalid request descriptor
2010-10-06T23:29:34.171578+00:00 c013otin01-test fenced[1891]: fence "c013otin07-test" failed
2010-10-06T23:29:39.170656+00:00 c013otin01-test fenced[1891]: fencing node "c013otin07-test"
2010-10-06T23:30:01.418667+00:00 c013otin01-test rsync_policy_files: receiving file list ... done
2010-10-06T23:30:01.418699+00:00 c013otin01-test rsync_policy_files:
2010-10-06T23:30:01.418708+00:00 c013otin01-test rsync_policy_files: sent 30 bytes received 12 bytes 84.00 bytes/sec
2010-10-06T23:30:01.418716+00:00 c013otin01-test rsync_policy_files: total size is 0 speedup is 0.00
2010-10-06T23:30:08.760903+00:00 c013otin01-test kernel: INFO: task clurgmgrd:25022 blocked for more than 120 seconds.
2010-10-06T23:30:08.760918+00:00 c013otin01-test kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2010-10-06T23:30:08.760923+00:00 c013otin01-test kernel: clurgmgrd D ffff880001064b60 0 25022 3518 25023 25019 (NOTLB)
2010-10-06T23:30:08.760926+00:00 c013otin01-test kernel: ffff88016437fdb8 0000000000000286 0000000000000000 00000000ee8f8108
2010-10-06T23:30:08.760928+00:00 c013otin01-test kernel: 0000000000000008 ffff88098ed37080 ffff88097c9207a0 00000000000087b3
2010-10-06T23:30:08.760930+00:00 c013otin01-test kernel: ffff88098ed37268 ffffffff8029ed82
2010-10-06T23:30:08.760933+00:00 c013otin01-test kernel: Call Trace:
2010-10-06T23:30:08.760937+00:00 c013otin01-test kernel: [<ffffffff8029ed82>] futex_wake+0x50/0xd4
2010-10-06T23:30:08.760940+00:00 c013otin01-test kernel: [<ffffffff8023fe9c>] do_futex+0x2c2/0xcfb
2010-10-06T23:30:08.760942+00:00 c013otin01-test kernel: [<ffffffff802644cb>] __down_read+0x82/0x9a
2010-10-06T23:30:08.760945+00:00 c013otin01-test kernel: [<ffffffff8830b468>] :dlm:dlm_user_request+0x2d/0x175
"""
Here is what the fence_xvmd log shows on one dom0:
"""
Request to fence: c013otin07-test
Evaluating Domain: c013otin07-test Last Owner: 7 State 1
Domain UUID Owner State
------ ---- ----- -----
c013operations01-test 9654e57b-7bb6-019e-937b-dc009f734a13 00001 00001
c013otin01-test 6fc9063b-5e9f-ef86-5ae2-8faa5fcde84a 00001 00001
c013summary01-test 10432e54-673f-8c61-d08d-591c42adce6e 00001 00002
Domain-0 00000000-0000-0000-0000-000000000000 00001 00001
Storing c013operations01-test
Storing c013otin01-test
Storing c013summary01-test
Request to fence: c013otin07-test
Evaluating Domain: c013otin07-test Last Owner: 7 State 1
"""
I did notice that group_tool state looks a bit borked:
"""
[root@dom0-01 ~]# group_tool
type level name id state
fence 0 default 00000000 JOIN_STOP_WAIT
[1 2 3 4 5 6 7 8 9 10]
dlm 1 rgmanager 00000000 JOIN_STOP_WAIT
[1 2 3 4 5 6 7 8 9 10]
"""
Is the group_tool output, the JOIN_STOP_WAIT the problem here? If so do you know how to fix it without rebooting all the nodes? I tried "fence_tool leave", and "fence_tool join" on all dom0's but that didn't resolve the problem.
Thanks
Joel
-- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster