Re: clvmd leaving kernel dlm uncontrolled lockspace

Andreas Pflug <pgadmin@pse-consulting.de> · Thu, 06 Jun 2013 08:17:17 +0200

Am 05.06.13 17:13, schrieb David Teigland:

A few different topics wrapped together there:

- With kill -9 clvmd (possibly combined with dlm_tool leave clvmd),
   you can manually clear/remove a userland lockspace like clvmd.

- If clvmd is blocked in the kernel in uninterruptible sleep, then
   the kill above will not work.  To make kill work, you'd locate the
   particular sleep in the kernel and determine if there's a way to
   make it interruptible, and cleanly back it out.

I had clvmds blocked in kernel, so how to "locate the sleep and make it 
interruptible"?

- If clvmd is blocked in the kernel for >120s, you probably want to
   investigate what is causing that, rather than being too hasty
   killing clvmd.
INFO: task clvmd:19766 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
clvmd           D ffff880058ec4870     0 19766      1 0x00000000
ffff880058ec4870 0000000000000282 0000000000000000 ffff8800698d9590
0000000000013740 ffff880063787fd8 ffff880063787fd8 0000000000013740
ffff880058ec4870 ffff880063786010 0000000000000001 0000000100000000
Call Trace:
[<ffffffff81367f7a>] ? rwsem_down_failed_common+0xda/0x10e
[<ffffffff811c5924>] ? call_rwsem_down_read_failed+0x14/0x30
[<ffffffff813678da>] ? down_read+0x17/0x19
[<ffffffffa059b705>] ? dlm_user_request+0x3a/0x17e [dlm]
[<ffffffffa05a40e4>] ? device_write+0x279/0x5f7 [dlm]
[<ffffffff810f7d7a>] ? __kmalloc+0x104/0x116
[<ffffffffa05a416b>] ? device_write+0x300/0x5f7 [dlm]
[<ffffffff810042c9>] ? xen_mc_flush+0x12b/0x158
[<ffffffff8117489e>] ? security_file_permission+0x18/0x2d
[<ffffffff81106dd5>] ? vfs_write+0xa4/0xff
[<ffffffff81106ee6>] ? sys_write+0x45/0x6e
[<ffffffff8136d652>] ? system_call_fastpath+0x16/0x1b

On 3.2.35

- If corosync or dlm_controld are killed while dlm lockspaces exist,
   they become "uncontrolled" and would need to be forcibly cleaned up.
   This cleanup may be possible to implement for userland lockspaces,
   but it's not been clear that the benefits would greatly outweigh
   using reboot for this.

On a machine being Xen host with 20+ running VMs I'd clearly prefer to 
clean those orphaned memory space and go on.... I still have 4 hosts to 
be rebooted which serve as xen host, providing their devices from 
clvmd-controlled (i.e. now uncontrollable) san space.

- Killing either corosync or dlm_controld is very unlikely help
   anything, and more likely to cause further problems, so it should
   be avoided as far as possible.

I understand. One reason to upgrade was that I had infrequent 
situations, where the corosync 1.4.2 instances on all nodes exitted 
simultaneously without any log notice. Having this with the new 
corosync2.3/dlm infrastructure would mean a whole cluster having 
uncontrollable san space. So either the lockspace should be 
automatically reclaimed if dlm_controld finds it uncontrolled, or a 
means to clean it up manually should be available.

Regards,
Andreas

Dave

_______________________________________________
linux-lvm mailing list
linux-lvm@redhat.com
https://www.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/