Re: Gluster and NFS-Ganesha - cluster is down after reboot

Soumya Koduri <skoduri@xxxxxxxxxx> · Mon, 15 May 2017 16:26:01 +0530

On 05/12/2017 06:27 PM, Adam Ru wrote:
Hi Soumya,

Thank you very much for last response – very useful.

I apologize for delay, I had to find time for another testing.

I updated instructions that I provided in previous e-mail. *** means
that the step was added.

Instructions:
 - Clean installation of CentOS 7.3 with all updates, 3x node,
resolvable IPs and VIPs
 - Stopped firewalld (just for testing)
 - *** SELinux in permissive mode (I had to, will explain bellow)
 - Install "centos-release-gluster" to get "centos-gluster310" repo
and install following (nothing else):
 --- glusterfs-server
 --- glusterfs-ganesha
 - Passwordless SSH between all nodes
(/var/lib/glusterd/nfs/secret.pem and secret.pem.pub on all nodes)
 - systemctl enable and start glusterd
 - gluster peer probe <other nodes>
 - gluster volume set all cluster.enable-shared-storage enable
 - systemctl enable and start pcsd.service
 - systemctl enable pacemaker.service (cannot be started at this moment)
 - Set password for hacluster user on all nodes
 - pcs cluster auth <node 1> <node 2> <node 3> -u hacluster -p blabla
 - mkdir /var/run/gluster/shared_storage/nfs-ganesha/
 - touch /var/run/gluster/shared_storage/nfs-ganesha/ganesha.conf (not
sure if needed)
 - vi /var/run/gluster/shared_storage/nfs-ganesha/ganesha-ha.conf and
insert configuration
 - Try list files on other nodes: ls
/var/run/gluster/shared_storage/nfs-ganesha/
 - gluster nfs-ganesha enable
 - *** systemctl enable pacemaker.service (again, since pacemaker was
disabled at this point)
 - *** Check owner of "state", "statd", "sm" and "sm.bak" in
/var/lib/nfs/ (I had to: chown rpcuser:rpcuser
/var/lib/nfs/statd/state)
 - Check on other nodes that nfs-ganesha.service is running and "pcs
status" shows started resources
 - gluster volume create mynewshare replica 3 transport tcp
node1:/<dir> node2:/<dir> node3:/<dir>
 - gluster volume start mynewshare
 - gluster vol set mynewshare ganesha.enable on

At this moment, this is status of important (I think) services:

-- corosync.service             disabled
-- corosync-notifyd.service     disabled
-- glusterd.service             enabled
-- glusterfsd.service           disabled
-- pacemaker.service            enabled
-- pcsd.service                 enabled
-- nfs-ganesha.service          disabled
-- nfs-ganesha-config.service   static
-- nfs-ganesha-lock.service     static

-- corosync.service             active (running)
-- corosync-notifyd.service     inactive (dead)
-- glusterd.service             active (running)
-- glusterfsd.service           inactive (dead)
-- pacemaker.service            active (running)
-- pcsd.service                 active (running)
-- nfs-ganesha.service          active (running)
-- nfs-ganesha-config.service   inactive (dead)
-- nfs-ganesha-lock.service     active (running)

May I ask you a few questions please?

1. Could you please confirm that services above has correct status/state?

Looks good to the best of my knowledge.

2. When I restart a node then nfs-ganesha is not running. Of course I
cannot enable it since it needs to be enabled after shared storage is
mounted. What is best practice to start it automatically so I don’t
have to worry about restarting node? Should I create a script that
will check whether shared storage was mounted and then start
nfs-ganesha? How do you do this in production?

That's right.. We have plans to address this in near future (probably by 
having a new .service which mounts shared_storage before starting 
nfs-ganesha). But until then ..yes having a custom defined script to do 
so is the only way to automate it.

3. SELinux is an issue, is that a known bug?

When I restart a node and start nfs-ganesha.service with SELinux in
permissive mode:

sudo grep 'statd' /var/log/messages
May 12 12:05:46 mynode1 rpc.statd[2415]: Version 1.3.0 starting
May 12 12:05:46 mynode1 rpc.statd[2415]: Flags: TI-RPC
May 12 12:05:46 mynode1 rpc.statd[2415]: Failed to read
/var/lib/nfs/statd/state: Success
May 12 12:05:46 mynode1 rpc.statd[2415]: Initializing NSM state
May 12 12:05:52 mynode1 rpc.statd[2415]: Received SM_UNMON_ALL request
from mynode1.localdomain while not monitoring any hosts

systemctl status nfs-ganesha-lock.service --full
● nfs-ganesha-lock.service - NFS status monitor for NFSv2/3 locking.
   Loaded: loaded (/usr/lib/systemd/system/nfs-ganesha-lock.service;
static; vendor preset: disabled)
   Active: active (running) since Fri 2017-05-12 12:05:46 UTC; 1min 43s ago
  Process: 2414 ExecStart=/usr/sbin/rpc.statd --no-notify $STATDARGS
(code=exited, status=0/SUCCESS)
 Main PID: 2415 (rpc.statd)
   CGroup: /system.slice/nfs-ganesha-lock.service
           └─2415 /usr/sbin/rpc.statd --no-notify

May 12 12:05:46 mynode1.localdomain systemd[1]: Starting NFS status
monitor for NFSv2/3 locking....
May 12 12:05:46 mynode1.localdomain rpc.statd[2415]: Version 1.3.0 starting
May 12 12:05:46 mynode1.localdomain rpc.statd[2415]: Flags: TI-RPC
May 12 12:05:46 mynode1.localdomain rpc.statd[2415]: Failed to read
/var/lib/nfs/statd/state: Success
May 12 12:05:46 mynode1.localdomain rpc.statd[2415]: Initializing NSM state
May 12 12:05:46 mynode1.localdomain systemd[1]: Started NFS status
monitor for NFSv2/3 locking..
May 12 12:05:52 mynode1.localdomain rpc.statd[2415]: Received
SM_UNMON_ALL request from mynode1.localdomain while not monitoring any
hosts

When I restart a node and start nfs-ganesha.service with SELinux in
enforcing mode:

sudo grep 'statd' /var/log/messages
May 12 12:14:01 mynode1 rpc.statd[1743]: Version 1.3.0 starting
May 12 12:14:01 mynode1 rpc.statd[1743]: Flags: TI-RPC
May 12 12:14:01 mynode1 rpc.statd[1743]: Failed to open directory sm:
Permission denied
May 12 12:14:01 mynode1 rpc.statd[1743]: Failed to open
/var/lib/nfs/statd/state: Permission denied

systemctl status nfs-ganesha-lock.service --full
● nfs-ganesha-lock.service - NFS status monitor for NFSv2/3 locking.
   Loaded: loaded (/usr/lib/systemd/system/nfs-ganesha-lock.service;
static; vendor preset: disabled)
   Active: failed (Result: exit-code) since Fri 2017-05-12 12:14:01
UTC; 1min 21s ago
  Process: 1742 ExecStart=/usr/sbin/rpc.statd --no-notify $STATDARGS
(code=exited, status=1/FAILURE)

May 12 12:14:01 mynode1.localdomain systemd[1]: Starting NFS status
monitor for NFSv2/3 locking....
May 12 12:14:01 mynode1.localdomain rpc.statd[1743]: Version 1.3.0 starting
May 12 12:14:01 mynode1.localdomain rpc.statd[1743]: Flags: TI-RPC
May 12 12:14:01 mynode1.localdomain rpc.statd[1743]: Failed to open
directory sm: Permission denied
May 12 12:14:01 mynode1.localdomain systemd[1]:
nfs-ganesha-lock.service: control process exited, code=exited status=1
May 12 12:14:01 mynode1.localdomain systemd[1]: Failed to start NFS
status monitor for NFSv2/3 locking..
May 12 12:14:01 mynode1.localdomain systemd[1]: Unit
nfs-ganesha-lock.service entered failed state.
May 12 12:14:01 mynode1.localdomain systemd[1]: nfs-ganesha-lock.service failed.

Cant remember right now. Could you please paste the AVCs you get, and 
se-linux packages version. Or preferably please file a bug. We can get 
the details verified from selinux members.

Thanks,
Soumya

On Fri, May 5, 2017 at 8:10 PM, Soumya Koduri <skoduri@xxxxxxxxxx> wrote:

On 05/05/2017 08:04 PM, Adam Ru wrote:

Hi Soumya,

Thank you for the answer.

Enabling Pacemaker? Yes, you’re completely right, I didn’t do it. Thank
you.

I spent some time by testing and I have some results. This is what I did:

 - Clean installation of CentOS 7.3 with all updates, 3x node,
resolvable IPs and VIPs
 - Stopped firewalld (just for testing)
 - Install "centos-release-gluster" to get "centos-gluster310" repo and
install following (nothing else):
 --- glusterfs-server
 --- glusterfs-ganesha
 - Passwordless SSH between all nodes (/var/lib/glusterd/nfs/secret.pem
and secret.pem.pub on all nodes)
 - systemctl enable and start glusterd
 - gluster peer probe <other nodes>
 - gluster volume set all cluster.enable-shared-storage enable
 - systemctl enable and start pcsd.service
 - systemctl enable pacemaker.service (cannot be started at this moment)
 - Set password for hacluster user on all nodes
 - pcs cluster auth <node 1> <node 2> <node 3> -u hacluster -p blabla
 - mkdir /var/run/gluster/shared_storage/nfs-ganesha/
 - touch /var/run/gluster/shared_storage/nfs-ganesha/ganesha.conf (not
sure if needed)
 - vi /var/run/gluster/shared_storage/nfs-ganesha/ganesha-ha.conf and
insert configuration
 - Try list files on other nodes: ls
/var/run/gluster/shared_storage/nfs-ganesha/
 - gluster nfs-ganesha enable
 - Check on other nodes that nfs-ganesha.service is running and "pcs
status" shows started resources
 - gluster volume create mynewshare replica 3 transport tcp node1:/<dir>
node2:/<dir> node3:/<dir>
 - gluster volume start mynewshare
 - gluster vol set mynewshare ganesha.enable on

After these steps, all VIPs are pingable and I can mount node1:/mynewshare

Funny thing is that pacemaker.service is disabled again (something
disabled it). This is status of important (I think) services:

yeah. We too had observed this recently. We guess probably pcs cluster setup
command first destroys existing cluster (if any) which may be disabling
pacemaker too.

systemctl list-units --all
# corosync.service             loaded    active   running
# glusterd.service             loaded    active   running
# nfs-config.service           loaded    inactive dead
# nfs-ganesha-config.service   loaded    inactive dead
# nfs-ganesha-lock.service     loaded    active   running
# nfs-ganesha.service          loaded    active   running
# nfs-idmapd.service           loaded    inactive dead
# nfs-mountd.service           loaded    inactive dead
# nfs-server.service           loaded    inactive dead
# nfs-utils.service            loaded    inactive dead
# pacemaker.service            loaded    active   running
# pcsd.service                 loaded    active   running

systemctl list-unit-files --all
# corosync-notifyd.service    disabled
# corosync.service            disabled
# glusterd.service            enabled
# glusterfsd.service          disabled
# nfs-blkmap.service          disabled
# nfs-config.service          static
# nfs-ganesha-config.service  static
# nfs-ganesha-lock.service    static
# nfs-ganesha.service         disabled
# nfs-idmap.service           static
# nfs-idmapd.service          static
# nfs-lock.service            static
# nfs-mountd.service          static
# nfs-rquotad.service         disabled
# nfs-secure-server.service   static
# nfs-secure.service          static
# nfs-server.service          disabled
# nfs-utils.service           static
# nfs.service                 disabled
# nfslock.service             static
# pacemaker.service           disabled
# pcsd.service                enabled

I enabled pacemaker again on all nodes and restart all nodes one by one.

After reboot all VIPs are gone and I can see that nfs-ganesha.service
isn’t running. When I start it on at least two nodes then VIPs are
pingable again and I can mount NFS again. But there is still some issue
in the setup because when I check nfs-ganesha-lock.service I get:

systemctl -l status nfs-ganesha-lock.service
● nfs-ganesha-lock.service - NFS status monitor for NFSv2/3 locking.
   Loaded: loaded (/usr/lib/systemd/system/nfs-ganesha-lock.service;
static; vendor preset: disabled)
   Active: failed (Result: exit-code) since Fri 2017-05-05 13:43:37 UTC;
31min ago
  Process: 6203 ExecStart=/usr/sbin/rpc.statd --no-notify $STATDARGS
(code=exited, status=1/FAILURE)

May 05 13:43:37 node0.localdomain systemd[1]: Starting NFS status
monitor for NFSv2/3 locking....
May 05 13:43:37 node0.localdomain rpc.statd[6205]: Version 1.3.0 starting
May 05 13:43:37 node0.localdomain rpc.statd[6205]: Flags: TI-RPC
May 05 13:43:37 node0.localdomain rpc.statd[6205]: Failed to open
directory sm: Permission denied

Okay this issue was fixed and the fix should be present in 3.10 too -
   https://review.gluster.org/#/c/16433/

Please check '/var/log/messages' for statd related errors and cross-check
permissions of that directory. You could manually chown owner:group of
/var/lib/nfs/statd/sm directory for now and then restart nfs-ganesha*
services.

Thanks,
Soumya

May 05 13:43:37 node0.localdomain rpc.statd[6205]: Failed to open
/var/lib/nfs/statd/state: Permission denied
May 05 13:43:37 node0.localdomain systemd[1]: nfs-ganesha-lock.service:
control process exited, code=exited status=1
May 05 13:43:37 node0.localdomain systemd[1]: Failed to start NFS status
monitor for NFSv2/3 locking..
May 05 13:43:37 node0.localdomain systemd[1]: Unit
nfs-ganesha-lock.service entered failed state.
May 05 13:43:37 node0.localdomain systemd[1]: nfs-ganesha-lock.service
failed.

Thank you,

Kind regards,

Adam

On Wed, May 3, 2017 at 10:32 AM, Mahdi Adnan <mahdi.adnan@xxxxxxxxxxx
<mailto:mahdi.adnan@xxxxxxxxxxx>> wrote:

    Hi,

    Same here, when i reboot the node i have to manually execute "pcs
    cluster start gluster01" and pcsd already enabled and started.

    Gluster 3.8.11

    Centos 7.3 latest

    Installed using CentOS Storage SIG repository

    --

    Respectfully*
    **Mahdi A. Mahdi*

------------------------------------------------------------------------
    *From:* gluster-users-bounces@xxxxxxxxxxx
    <mailto:gluster-users-bounces@xxxxxxxxxxx>
    <gluster-users-bounces@xxxxxxxxxxx
    <mailto:gluster-users-bounces@xxxxxxxxxxx>> on behalf of Adam Ru
    <ad.ruckel@xxxxxxxxx <mailto:ad.ruckel@xxxxxxxxx>>
    *Sent:* Wednesday, May 3, 2017 12:09:58 PM
    *To:* Soumya Koduri
    *Cc:* gluster-users@xxxxxxxxxxx <mailto:gluster-users@xxxxxxxxxxx>
    *Subject:* Re:  Gluster and NFS-Ganesha - cluster is

    down after reboot

    Hi Soumya,

    thank you very much for your reply.

    I enabled pcsd during setup and after reboot during troubleshooting
    I manually started it and checked resources (pcs status). They were
    not running. I didn’t find what was wrong but I’m going to try it
again.

    I’ve thoroughly checked

http://gluster.readthedocs.io/en/latest/Administrator%20Guide/NFS-Ganesha%20GlusterFS%20Integration/

<http://gluster.readthedocs.io/en/latest/Administrator%20Guide/NFS-Ganesha%20GlusterFS%20Integration/>
    and I can confirm that I followed all steps with one exception. I
    installed following RPMs:
    glusterfs-server
    glusterfs-fuse
    glusterfs-cli
    glusterfs-ganesha
    nfs-ganesha-xfs

    and the guide referenced above specifies:
    glusterfs-server
    glusterfs-api
    glusterfs-ganesha

    glusterfs-api is a dependency of one of RPMs that I installed so
    this is not a problem. But I cannot find any mention to install
    nfs-ganesha-xfs.

    I’ll try to setup the whole environment again without installing
    nfs-ganesha-xfs (I assume glusterfs-ganesha has all required
binaries).

    Again, thank you for you time to answer my previous message.

    Kind regards,
    Adam

    On Tue, May 2, 2017 at 8:49 AM, Soumya Koduri <skoduri@xxxxxxxxxx
    <mailto:skoduri@xxxxxxxxxx>> wrote:

        Hi,

        On 05/02/2017 01:34 AM, Rudolf wrote:

            Hi Gluster users,

            First, I'd like to thank you all for this amazing
            open-source! Thank you!

            I'm working on home project – three servers with Gluster and
            NFS-Ganesha. My goal is to create HA NFS share with three
            copies of each
            file on each server.

            My systems are CentOS 7.3 Minimal install with the latest
            updates and
            the most current RPMs from "centos-gluster310" repository.

            I followed this tutorial:

http://blog.gluster.org/2015/10/linux-scale-out-nfsv4-using-nfs-ganesha-and-glusterfs-one-step-at-a-time/

<http://blog.gluster.org/2015/10/linux-scale-out-nfsv4-using-nfs-ganesha-and-glusterfs-one-step-at-a-time/>
            (second half that describes multi-node HA setup)

            with a few exceptions:

            1. All RPMs are from "centos-gluster310" repo that is
            installed by "yum
            -y install centos-release-gluster"
            2. I have three nodes (not four) with "replica 3" volume.
            3. I created empty ganesha.conf and not empty ganesha-ha.conf
in
            "/var/run/gluster/shared_storage/nfs-ganesha/" (referenced
            blog post is
            outdated, this is now requirement)
            4. ganesha-ha.conf doesn't have "HA_VOL_SERVER" since this
            isn't needed
            anymore.

        Please refer to

http://gluster.readthedocs.io/en/latest/Administrator%20Guide/NFS-Ganesha%20GlusterFS%20Integration/

<http://gluster.readthedocs.io/en/latest/Administrator%20Guide/NFS-Ganesha%20GlusterFS%20Integration/>

        It is being updated with latest changes happened wrt setup.

            When I finish configuration, all is good.
            nfs-ganesha.service is active
            and running and from client I can ping all three VIPs and I
            can mount
            NFS. Copied files are replicated to all nodes.

            But when I restart nodes (one by one, with 5 min. delay
            between) then I
            cannot ping or mount (I assume that all VIPs are down). So
            my setup
            definitely isn't HA.

            I found that:
            # pcs status
            Error: cluster is not currently running on this node

        This means pcsd service is not up. Did you enable (systemctl
        enable pcsd) pcsd service so that is comes up post reboot
        automatically. If not please start it manually.

            and nfs-ganesha.service is in inactive state. Btw. I didn't
            enable
            "systemctl enable nfs-ganesha" since I assume that this is
            something
            that Gluster does.

        Please check /var/log/ganesha.log for any errors/warnings.

        We recommend not to enable nfs-ganesha.service (by default), as
        the shared storage (where the ganesha.conf file resides now)
        should be up and running before nfs-ganesha gets started.
        So if enabled by default it could happen that shared_storage
        mount point is not yet up and it resulted in nfs-ganesha service
        failure. If you would like to address this, you could have a
        cron job which keeps checking the mount point health and then
        start nfs-ganesha service.

        Thanks,
        Soumya

            I assume that my issue is that I followed instructions in
            blog post from
            2015/10 that are outdated. Unfortunately I cannot find
            anything better –
            I spent whole day by googling.

            Would you be so kind and check the instructions in blog post
            and let me
            know what steps are wrong / outdated? Or please do you have
            more current
            instructions for Gluster+Ganesha setup?

            Thank you.

            Kind regards,
            Adam

            _______________________________________________
            Gluster-users mailing list
            Gluster-users@xxxxxxxxxxx <mailto:Gluster-users@xxxxxxxxxxx>
            http://lists.gluster.org/mailman/listinfo/gluster-users
            <http://lists.gluster.org/mailman/listinfo/gluster-users>

    --
    Adam

--
Adam

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-users