Re: umount stuck on NFS gateways switch over by using Pacemaker

<WD_Hwang@xxxxxxxxxxx> · Sat, 30 May 2015 14:49:16 +0000

Dear Eric:
  Thanks for your information. The command 'reboot -fn' works well.
  I have no idea that anybody has met 'umount stuck' condition like me. If it's possible, I hope I could find the reason why the fail over process doesn't work fine after 30 minutes.

WD

-----Original Message-----
From: Eric Eastman [mailto:eric.eastman@xxxxxxxxxxxxxx] 
Sent: Thursday, May 28, 2015 10:56 PM
To: WD Hwang/WHQ/Wistron
Cc: Ceph Users
Subject: Re:  umount stuck on NFS gateways switch over by using Pacemaker

On Thu, May 28, 2015 at 1:33 AM, <WD_Hwang@xxxxxxxxxxx> wrote:
>
> Hello,
>
>   I am testing NFS over RBD recently. I am trying to build the NFS HA environment under Ubuntu 14.04 for testing, and the packages version information as follows:
> - Ubuntu 14.04 : 3.13.0-32-generic(Ubuntu 14.04.2 LTS)
> - ceph : 0.80.9-0ubuntu0.14.04.2
> - ceph-common : 0.80.9-0ubuntu0.14.04.2
> - pacemaker (git20130802-1ubuntu2.3)
> - corosync (2.3.3-1ubuntu1)
> PS: I also tried ceph/ceph-common(0.87.1-1trusty and 0.87.2-1trusty) on 3.13.0-48-generic(Ubuntu 14.04.2) server and I got same situations.
>
>   The environment has 5 nodes int the Ceph cluster (3 MONs and 5 OSDs) and two NFS gateway (nfs1 and nfs2) for high availability. I issued the command, 'sudo service pacemaker stop', on 'nfs1' to force these resources stopped and transferred to 'nfs2', and vice versa.
>
> When the two nodes are up, I issue 'sudo service pacemaker stop' on 
> one node, the other node will take over all resources. Everything 
> looks fine. Then I wait about 30 minutes and do nothing to the NFS 
> gateways. I repeated the previous steps to test fail over procedure. I 
> found the process code of 'umount' is 'D' (uninterruptible sleep), the 
> 'ps' showed the following result
>
> root 21047 0.0 0.0 17412 952 ? D 16:39 0:00 umount /mnt/block1
>
> Have any idea to solve or work around? Because of 'umount' stuck, both 'reboot' and 'shutdown' command can't work well. So if I don't wait 20 minutes for 'umount' time out, the only way I can do is powering off the server directly.
>
> Any help would be much appreciated.
>

I am not sure how to get out of the stuck umount, but you can skip the shutdown scripts that call the umount during a reboot using:

reboot -fn

This can cause data loss, as it is like a power cycle, so it is best to run sync before running the reboot -fn command to flush out buffers.

Sometime when a system is really hung, reboot -fn does not work, but this seems to always work if run as root:

echo 1 > /proc/sys/kernel/sysrq
echo b > /proc/sysrq-trigger

Eric

---------------------------------------------------------------------------------------------------------------------------------------------------------------
This email contains confidential or legally privileged information and is for the sole use of its intended recipient. 
Any unauthorized review, use, copying or distribution of this email or the content of this email is strictly prohibited.
If you are not the intended recipient, you may reply to the sender and should delete this e-mail immediately.
---------------------------------------------------------------------------------------------------------------------------------------------------------------
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com