[SOLVED] [Nfs-ganesha-support] volume start: gv01: failed: Quorum not met. Volume operation not allowed.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hey All,

Appears I solved this one and NFS mounts now work on all my clients. No issues since fixing it a few hours back.

RESOLUTION

Auditd is to blame for the trouble. Noticed this in the logs on 2 of the 3 NFS servers (nfs01, nfs02, nfs03):

type=AVC msg=audit(1526965320.850:4094): avc: denied { write } for pid=8714 comm="ganesha.nfsd" name="nfs_0" dev="dm-0" ino=201547689 scontext=system_u:system_r:ganesha_t:s0 tcontext=system_u:object_r:krb5_host_rcache_t:s0 tclass=file type=SYSCALL msg=audit(1526965320.850:4094): arch=c000003e syscall=2 success=no exit=-13 a0=7f23b0003150 a1=2 a2=180 a3=2 items=0 ppid=1 pid=8714 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="ganesha.nfsd" exe="/usr/bin/ganesha.nfsd" subj=system_u:system_r:ganesha_t:s0 key=(null) type=PROCTITLE msg=audit(1526965320.850:4094): proctitle=2F7573722F62696E2F67616E657368612E6E667364002D4C002F7661722F6C6F672F67616E657368612F67616E657368612E6C6F67002D66002F6574632F67616E657368612F67616E657368612E636F6E66002D4E004E49565F4556454E54 type=AVC msg=audit(1526965320.850:4095): avc: denied { unlink } for pid=8714 comm="ganesha.nfsd" name="nfs_0" dev="dm-0" ino=201547689 scontext=system_u:system_r:ganesha_t:s0 tcontext=system_u:object_r:krb5_host_rcache_t:s0 tclass=file type=SYSCALL msg=audit(1526965320.850:4095): arch=c000003e syscall=87 success=no exit=-13 a0=7f23b0004100 a1=7f23b0000050 a2=7f23b0004100 a3=5 items=0 ppid=1 pid=8714 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="ganesha.nfsd" exe="/usr/bin/ganesha.nfsd" subj=system_u:system_r:ganesha_t:s0 key=(null) type=PROCTITLE msg=audit(1526965320.850:4095): proctitle=2F7573722F62696E2F67616E657368612E6E667364002D4C002F7661722F6C6F672F67616E657368612F67616E657368612E6C6F67002D66002F6574632F67616E657368612F67616E657368612E636F6E66002D4E004E49565F4556454E54

Fix was to adjust the SELinux rules using audit2allow.

All the errors below including the one in the link below, were due to that.

Turns out that when ever it worked, it hit the only working server in the system, nfs03. Whenever it didn't work, it was hitting the non working servers. So sometimes it worked, and other times it didn't. It looked like it was to do with Haproxy / Keepalived as well since I couldn't mount using the VIP but could using the host. But that wasn't the case either.

I've also added the third brick to the Gluster FS, nfs03, trying to see if the backend FS was to blame since Gluster FS recommends 3 bricks minimum for replication, but that had no effect.

In case anyone runs into this, I've added notes here as well:

http://microdevsys.com/wp/kernel-nfs-nfs4_discover_server_trunking-unhandled-error-512-exiting-with-error-eio-and-mount-hangs/

http://microdevsys.com/wp/nfs-reply-xid-3844308326-reply-err-20-auth-rejected-credentials-client-should-begin-new-session/

The errors thrown included:

NFS reply xid 3844308326 reply ERR 20: Auth Rejected Credentials (client should begin new session)

kernel: NFS: nfs4_discover_server_trunking unhandled error -512. Exiting with error EIO and mount hangs

+ the kernel exception below.

--
Cheers,
Tom K.
-------------------------------------------------------------------------------------

Living on earth is expensive, but it includes a free trip around the sun.


May 21 23:53:13 psql01 kernel: CPU: 3 PID: 2273 Comm: mount.nfs Tainted: G L ------------ 3.10.0-693.21.1.el7.x86_64 #1
.
.
.
May 21 23:53:13 psql01 kernel: task: ffff880136335ee0 ti: ffff8801376b0000 task.ti: ffff8801376b0000 May 21 23:53:13 psql01 kernel: RIP: 0010:[<ffffffff816b6545>] [<ffffffff816b6545>] _raw_spin_unlock_irqrestore+0x15/0x20
May 21 23:53:13 psql01 kernel: RSP: 0018:ffff8801376b3a60  EFLAGS: 00000206
May 21 23:53:13 psql01 kernel: RAX: ffffffffc05ab078 RBX: ffff880036973928 RCX: dead000000000200 May 21 23:53:13 psql01 kernel: RDX: ffffffffc05ab078 RSI: 0000000000000206 RDI: 0000000000000206 May 21 23:53:13 psql01 kernel: RBP: ffff8801376b3a60 R08: ffff8801376b3ab8 R09: ffff880137de1200 May 21 23:53:13 psql01 kernel: R10: ffff880036973928 R11: 0000000000000000 R12: ffff880036973928 May 21 23:53:13 psql01 kernel: R13: ffff8801376b3a58 R14: ffff88013fd98a40 R15: ffff8801376b3a58 May 21 23:53:13 psql01 kernel: FS: 00007fab48f07880(0000) GS:ffff88013fd80000(0000) knlGS:0000000000000000 May 21 23:53:13 psql01 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b May 21 23:53:13 psql01 kernel: CR2: 00007f99793d93cc CR3: 000000013761e000 CR4: 00000000000007e0 May 21 23:53:13 psql01 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 May 21 23:53:13 psql01 kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
May 21 23:53:13 psql01 kernel: Call Trace:
May 21 23:53:13 psql01 kernel: [<ffffffff810b4d86>] finish_wait+0x56/0x70
May 21 23:53:13 psql01 kernel: [<ffffffffc0580361>] nfs_wait_client_init_complete+0xa1/0xe0 [nfs] May 21 23:53:13 psql01 kernel: [<ffffffff810b4fc0>] ? wake_up_atomic_t+0x30/0x30 May 21 23:53:13 psql01 kernel: [<ffffffffc0581e9b>] nfs_get_client+0x22b/0x470 [nfs] May 21 23:53:13 psql01 kernel: [<ffffffffc05eafd8>] nfs4_set_client+0x98/0x130 [nfsv4] May 21 23:53:13 psql01 kernel: [<ffffffffc05ec77e>] nfs4_create_server+0x13e/0x3b0 [nfsv4] May 21 23:53:13 psql01 kernel: [<ffffffffc05e391e>] nfs4_remote_mount+0x2e/0x60 [nfsv4]
May 21 23:53:13 psql01 kernel: [<ffffffff81209f1e>] mount_fs+0x3e/0x1b0
May 21 23:53:13 psql01 kernel: [<ffffffff811aa685>] ? __alloc_percpu+0x15/0x20 May 21 23:53:13 psql01 kernel: [<ffffffff81226d57>] vfs_kern_mount+0x67/0x110 May 21 23:53:13 psql01 kernel: [<ffffffffc05e3846>] nfs_do_root_mount+0x86/0xc0 [nfsv4] May 21 23:53:13 psql01 kernel: [<ffffffffc05e3c44>] nfs4_try_mount+0x44/0xc0 [nfsv4] May 21 23:53:13 psql01 kernel: [<ffffffffc05826d7>] ? get_nfs_version+0x27/0x90 [nfs] May 21 23:53:13 psql01 kernel: [<ffffffffc058ec9b>] nfs_fs_mount+0x4cb/0xda0 [nfs] May 21 23:53:13 psql01 kernel: [<ffffffffc058fbe0>] ? nfs_clone_super+0x140/0x140 [nfs] May 21 23:53:13 psql01 kernel: [<ffffffffc058daa0>] ? param_set_portnr+0x70/0x70 [nfs]
May 21 23:53:13 psql01 kernel: [<ffffffff81209f1e>] mount_fs+0x3e/0x1b0
May 21 23:53:13 psql01 kernel: [<ffffffff811aa685>] ? __alloc_percpu+0x15/0x20 May 21 23:53:13 psql01 kernel: [<ffffffff81226d57>] vfs_kern_mount+0x67/0x110
May 21 23:53:13 psql01 kernel: [<ffffffff81229263>] do_mount+0x233/0xaf0
May 21 23:53:13 psql01 kernel: [<ffffffff81229ea6>] SyS_mount+0x96/0xf0
May 21 23:53:13 psql01 kernel: [<ffffffff816c0715>] system_call_fastpath+0x1c/0x21 May 21 23:53:13 psql01 kernel: [<ffffffff816c0661>] ? system_call_after_swapgs+0xae/0x146




On 5/7/2018 10:28 PM, TomK wrote:
This list has been deprecated. Please subscribe to the new support list at lists.nfs-ganesha.org.
On 4/11/2018 11:54 AM, Alex K wrote:

Hey Guy's,

Returning to this topic, after disabling the the quorum:

cluster.quorum-type: none
cluster.server-quorum-type: none

I've ran into a number of gluster errors (see below).

I'm using gluster as the backend for my NFS storage.  I have gluster running on two nodes, nfs01 and nfs02.  It's mounted on /n on each host.  The path /n is in turn shared out by NFS Ganesha.  It's a two node setup with quorum disabled as noted below:

[root@nfs02 ganesha]# mount|grep gv01
nfs02:/gv01 on /n type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)

[root@nfs01 glusterfs]# mount|grep gv01
nfs01:/gv01 on /n type fuse.glusterfs (rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072)

Gluster always reports as working no matter when I type the below two commands:

[root@nfs01 glusterfs]# gluster volume info

Volume Name: gv01
Type: Replicate
Volume ID: e5ccc75e-5192-45ac-b410-a34ebd777666
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: nfs01:/bricks/0/gv01
Brick2: nfs02:/bricks/0/gv01
Options Reconfigured:
cluster.server-quorum-type: none
cluster.quorum-type: none
server.event-threads: 8
client.event-threads: 8
performance.readdir-ahead: on
performance.write-behind-window-size: 8MB
performance.io-thread-count: 16
performance.cache-size: 1GB
nfs.trusted-sync: on
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
[root@nfs01 glusterfs]# gluster status
unrecognized word: status (position 0)
[root@nfs01 glusterfs]# gluster volume status
Status of volume: gv01
Gluster process                             TCP Port  RDMA Port  Online Pid ------------------------------------------------------------------------------
Brick nfs01:/bricks/0/gv01                  49152     0          Y 1422
Brick nfs02:/bricks/0/gv01                  49152     0          Y 1422
Self-heal Daemon on localhost               N/A       N/A        Y 1248
Self-heal Daemon on nfs02.nix.my.dom       N/A       N/A        Y 1251

Task Status of Volume gv01
------------------------------------------------------------------------------
There are no active volume tasks

[root@nfs01 glusterfs]#

[root@nfs01 glusterfs]# rpm -aq|grep -Ei gluster
glusterfs-3.13.2-2.el7.x86_64
glusterfs-devel-3.13.2-2.el7.x86_64
glusterfs-fuse-3.13.2-2.el7.x86_64
glusterfs-api-devel-3.13.2-2.el7.x86_64
centos-release-gluster313-1.0-1.el7.centos.noarch
python2-gluster-3.13.2-2.el7.x86_64
glusterfs-client-xlators-3.13.2-2.el7.x86_64
glusterfs-server-3.13.2-2.el7.x86_64
libvirt-daemon-driver-storage-gluster-3.2.0-14.el7_4.9.x86_64
glusterfs-cli-3.13.2-2.el7.x86_64
centos-release-gluster312-1.0-1.el7.centos.noarch
python2-glusterfs-api-1.1-1.el7.noarch
glusterfs-libs-3.13.2-2.el7.x86_64
glusterfs-extra-xlators-3.13.2-2.el7.x86_64
glusterfs-api-3.13.2-2.el7.x86_64
[root@nfs01 glusterfs]#

The short of it is that everything works and mounts on guests work as long as I don't try to write to the NFS share from my clients.  If I try to write to the share, everything comes apart like this:

-sh-4.2$ pwd
/n/my.dom/tom
-sh-4.2$ ls -altri
total 6258
11715278280495367299 -rw-------. 1 tom@xxxxxx tom@xxxxxx     231 Feb 17 20:15 .bashrc 10937819299152577443 -rw-------. 1 tom@xxxxxx tom@xxxxxx     193 Feb 17 20:15 .bash_profile 10823746994379198104 -rw-------. 1 tom@xxxxxx tom@xxxxxx      18 Feb 17 20:15 .bash_logout 10718721668898812166 drwxr-xr-x. 3 root        root           4096 Mar 5 02:46 .. 12008425472191154054 drwx------. 2 tom@xxxxxx tom@xxxxxx    4096 Mar 18 03:07 .ssh 13763048923429182948 -rw-rw-r--. 1 tom@xxxxxx tom@xxxxxx 6359568 Mar 25 22:38 opennebula-cores.tar.gz 11674701370106210511 -rw-rw-r--. 1 tom@xxxxxx tom@xxxxxx       4 Apr  9 23:25 meh.txt  9326637590629964475 -rw-r--r--. 1 tom@xxxxxx tom@xxxxxx   24970 May  1 01:30 nfs-trace-working.dat.gz  9337343577229627320 -rw-------. 1 tom@xxxxxx tom@xxxxxx    3734 May  1 23:38 .bash_history 11438151930727967183 drwx------. 3 tom@xxxxxx tom@xxxxxx    4096 May  1 23:58 .  9865389421596220499 -rw-r--r--. 1 tom@xxxxxx tom@xxxxxx    4096 May  1 23:58 .meh.txt.swp
-sh-4.2$ touch test.txt
-sh-4.2$ vi test.txt
-sh-4.2$ ls -altri
ls: cannot open directory .: Permission denied
-sh-4.2$ ls -altri
ls: cannot open directory .: Permission denied
-sh-4.2$ ls -altri

This is followed by a slew of other errors in apps using the gluster volume.  These errors include:

02/05/2018 23:10:52 : epoch 5aea7bd5 : nfs02.nix.my.dom : ganesha.nfsd-5891[svc_12] nfs_rpc_process_request :DISP :INFO :Could not authenticate request... rejecting with AUTH_STAT=RPCSEC_GSS_CREDPROBLEM


==> ganesha-gfapi.log <==
[2018-05-03 04:32:18.009245] I [MSGID: 114021] [client.c:2369:notify] 0-gv01-client-0: current graph is no longer active, destroying rpc_client [2018-05-03 04:32:18.009338] I [MSGID: 114021] [client.c:2369:notify] 0-gv01-client-1: current graph is no longer active, destroying rpc_client [2018-05-03 04:32:18.009499] I [MSGID: 114018] [client.c:2285:client_rpc_notify] 0-gv01-client-0: disconnected from gv01-client-0. Client process will keep trying to connect to glusterd until brick's port is available [2018-05-03 04:32:18.009557] I [MSGID: 114018] [client.c:2285:client_rpc_notify] 0-gv01-client-1: disconnected from gv01-client-1. Client process will keep trying to connect to glusterd until brick's port is available [2018-05-03 04:32:18.009610] E [MSGID: 108006] [afr-common.c:5164:__afr_handle_child_down_event] 0-gv01-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up.


[2018-05-01 22:43:06.412067] E [MSGID: 114058] [client-handshake.c:1571:client_query_portmap_cbk] 0-gv01-client-1: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running. [2018-05-01 22:43:55.554833] E [socket.c:2374:socket_connect_finish] 0-gv01-client-0: connection to 192.168.0.131:49152 failed (Connection refused); disconnecting socket


So I'm wondering, if this is due to the two node gluster, as it seems to be, and what is it that I really need to do here?  Should I go with the recommended 3 node setup to avoid this which would include a proper quorum?  Or is there more to this and it really doesn't matter if I have a 2 node gluster cluster without a quorum and this is due to something else still?

Again, anytime I check the gluter volumes, everything checks out.  The results of both 'gluster volume info' and 'gluster volume status' is always as I pasted above, fully working.

I'm also using the Linux KDC Free IPA with this solution as well.



_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-users




[Index of Archives]     [Gluster Development]     [Linux Filesytems Development]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux