[SOLVED] [Nfs-ganesha-support] volume start: gv01: failed: Quorum not met. Volume operation not allowed.

TomK <tomkcpr@xxxxxxxxxxx> · Tue, 22 May 2018 01:43:42 -0400

Hey All,

Appears I solved this one and NFS mounts now work on all my clients.  No 
issues since fixing it a few hours back.

RESOLUTION

Auditd is to blame for the trouble.  Noticed this in the logs on 2 of 
the 3 NFS servers (nfs01, nfs02, nfs03):

type=AVC msg=audit(1526965320.850:4094): avc:  denied  { write } for 
pid=8714 comm="ganesha.nfsd" name="nfs_0" dev="dm-0" ino=201547689 
scontext=system_u:system_r:ganesha_t:s0 
tcontext=system_u:object_r:krb5_host_rcache_t:s0 tclass=file
type=SYSCALL msg=audit(1526965320.850:4094): arch=c000003e syscall=2 
success=no exit=-13 a0=7f23b0003150 a1=2 a2=180 a3=2 items=0 ppid=1 
pid=8714 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 
fsgid=0 tty=(none) ses=4294967295 comm="ganesha.nfsd" 
exe="/usr/bin/ganesha.nfsd" subj=system_u:system_r:ganesha_t:s0 key=(null)
type=PROCTITLE msg=audit(1526965320.850:4094): 
proctitle=2F7573722F62696E2F67616E657368612E6E667364002D4C002F7661722F6C6F672F67616E657368612F67616E657368612E6C6F67002D66002F6574632F67616E657368612F67616E657368612E636F6E66002D4E004E49565F4556454E54
type=AVC msg=audit(1526965320.850:4095): avc:  denied  { unlink } for 
pid=8714 comm="ganesha.nfsd" name="nfs_0" dev="dm-0" ino=201547689 
scontext=system_u:system_r:ganesha_t:s0 
tcontext=system_u:object_r:krb5_host_rcache_t:s0 tclass=file
type=SYSCALL msg=audit(1526965320.850:4095): arch=c000003e syscall=87 
success=no exit=-13 a0=7f23b0004100 a1=7f23b0000050 a2=7f23b0004100 a3=5 
items=0 ppid=1 pid=8714 auid=4294967295 uid=0 gid=0 euid=0 suid=0 
fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 
comm="ganesha.nfsd" exe="/usr/bin/ganesha.nfsd" 
subj=system_u:system_r:ganesha_t:s0 key=(null)
type=PROCTITLE msg=audit(1526965320.850:4095): 
proctitle=2F7573722F62696E2F67616E657368612E6E667364002D4C002F7661722F6C6F672F67616E657368612F67616E657368612E6C6F67002D66002F6574632F67616E657368612F67616E657368612E636F6E66002D4E004E49565F4556454E54

Fix was to adjust the SELinux rules using audit2allow.

All the errors below including the one in the link below, were due to that.

Turns out that when ever it worked, it hit the only working server in 
the system, nfs03.  Whenever it didn't work, it was hitting the non 
working servers.  So sometimes it worked, and other times it didn't.  It 
looked like it was to do with Haproxy / Keepalived as well since I 
couldn't mount using the VIP but could using the host.  But that wasn't 
the case either.

I've also added the third brick to the Gluster FS, nfs03, trying to see 
if the backend FS was to blame since Gluster FS recommends 3 bricks 
minimum for replication, but that had no effect.

In case anyone runs into this, I've added notes here as well:

http://microdevsys.com/wp/kernel-nfs-nfs4_discover_server_trunking-unhandled-error-512-exiting-with-error-eio-and-mount-hangs/

http://microdevsys.com/wp/nfs-reply-xid-3844308326-reply-err-20-auth-rejected-credentials-client-should-begin-new-session/

The errors thrown included:

NFS reply xid 3844308326 reply ERR 20: Auth Rejected Credentials (client 
should begin new session)

kernel: NFS: nfs4_discover_server_trunking unhandled error -512. Exiting 
with error EIO and mount hangs

+ the kernel exception below.

--
Cheers,
Tom K.
-------------------------------------------------------------------------------------

Living on earth is expensive, but it includes a free trip around the sun.

May 21 23:53:13 psql01 kernel: CPU: 3 PID: 2273 Comm: mount.nfs Tainted: 
G             L ------------   3.10.0-693.21.1.el7.x86_64 #1
.
.
.
May 21 23:53:13 psql01 kernel: task: ffff880136335ee0 ti: 
ffff8801376b0000 task.ti: ffff8801376b0000
May 21 23:53:13 psql01 kernel: RIP: 0010:[<ffffffff816b6545>] 
[<ffffffff816b6545>] _raw_spin_unlock_irqrestore+0x15/0x20
May 21 23:53:13 psql01 kernel: RSP: 0018:ffff8801376b3a60  EFLAGS: 00000206
May 21 23:53:13 psql01 kernel: RAX: ffffffffc05ab078 RBX: 
ffff880036973928 RCX: dead000000000200
May 21 23:53:13 psql01 kernel: RDX: ffffffffc05ab078 RSI: 
0000000000000206 RDI: 0000000000000206
May 21 23:53:13 psql01 kernel: RBP: ffff8801376b3a60 R08: 
ffff8801376b3ab8 R09: ffff880137de1200
May 21 23:53:13 psql01 kernel: R10: ffff880036973928 R11: 
0000000000000000 R12: ffff880036973928
May 21 23:53:13 psql01 kernel: R13: ffff8801376b3a58 R14: 
ffff88013fd98a40 R15: ffff8801376b3a58
May 21 23:53:13 psql01 kernel: FS:  00007fab48f07880(0000) 
GS:ffff88013fd80000(0000) knlGS:0000000000000000
May 21 23:53:13 psql01 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 
000000008005003b
May 21 23:53:13 psql01 kernel: CR2: 00007f99793d93cc CR3: 
000000013761e000 CR4: 00000000000007e0
May 21 23:53:13 psql01 kernel: DR0: 0000000000000000 DR1: 
0000000000000000 DR2: 0000000000000000
May 21 23:53:13 psql01 kernel: DR3: 0000000000000000 DR6: 
00000000ffff0ff0 DR7: 0000000000000400
May 21 23:53:13 psql01 kernel: Call Trace:
May 21 23:53:13 psql01 kernel: [<ffffffff810b4d86>] finish_wait+0x56/0x70
May 21 23:53:13 psql01 kernel: [<ffffffffc0580361>] 
nfs_wait_client_init_complete+0xa1/0xe0 [nfs]
May 21 23:53:13 psql01 kernel: [<ffffffff810b4fc0>] ? 
wake_up_atomic_t+0x30/0x30
May 21 23:53:13 psql01 kernel: [<ffffffffc0581e9b>] 
nfs_get_client+0x22b/0x470 [nfs]
May 21 23:53:13 psql01 kernel: [<ffffffffc05eafd8>] 
nfs4_set_client+0x98/0x130 [nfsv4]
May 21 23:53:13 psql01 kernel: [<ffffffffc05ec77e>] 
nfs4_create_server+0x13e/0x3b0 [nfsv4]
May 21 23:53:13 psql01 kernel: [<ffffffffc05e391e>] 
nfs4_remote_mount+0x2e/0x60 [nfsv4]
May 21 23:53:13 psql01 kernel: [<ffffffff81209f1e>] mount_fs+0x3e/0x1b0
May 21 23:53:13 psql01 kernel: [<ffffffff811aa685>] ? 
__alloc_percpu+0x15/0x20
May 21 23:53:13 psql01 kernel: [<ffffffff81226d57>] 
vfs_kern_mount+0x67/0x110
May 21 23:53:13 psql01 kernel: [<ffffffffc05e3846>] 
nfs_do_root_mount+0x86/0xc0 [nfsv4]
May 21 23:53:13 psql01 kernel: [<ffffffffc05e3c44>] 
nfs4_try_mount+0x44/0xc0 [nfsv4]
May 21 23:53:13 psql01 kernel: [<ffffffffc05826d7>] ? 
get_nfs_version+0x27/0x90 [nfs]
May 21 23:53:13 psql01 kernel: [<ffffffffc058ec9b>] 
nfs_fs_mount+0x4cb/0xda0 [nfs]
May 21 23:53:13 psql01 kernel: [<ffffffffc058fbe0>] ? 
nfs_clone_super+0x140/0x140 [nfs]
May 21 23:53:13 psql01 kernel: [<ffffffffc058daa0>] ? 
param_set_portnr+0x70/0x70 [nfs]
May 21 23:53:13 psql01 kernel: [<ffffffff81209f1e>] mount_fs+0x3e/0x1b0
May 21 23:53:13 psql01 kernel: [<ffffffff811aa685>] ? 
__alloc_percpu+0x15/0x20
May 21 23:53:13 psql01 kernel: [<ffffffff81226d57>] 
vfs_kern_mount+0x67/0x110
May 21 23:53:13 psql01 kernel: [<ffffffff81229263>] do_mount+0x233/0xaf0
May 21 23:53:13 psql01 kernel: [<ffffffff81229ea6>] SyS_mount+0x96/0xf0
May 21 23:53:13 psql01 kernel: [<ffffffff816c0715>] 
system_call_fastpath+0x1c/0x21
May 21 23:53:13 psql01 kernel: [<ffffffff816c0661>] ? 
system_call_after_swapgs+0xae/0x146

On 5/7/2018 10:28 PM, TomK wrote:
This list has been deprecated. Please subscribe to the new support list 
at lists.nfs-ganesha.org.
On 4/11/2018 11:54 AM, Alex K wrote:

Hey Guy's,

Returning to this topic, after disabling the the quorum:

cluster.quorum-type: none
cluster.server-quorum-type: none

I've ran into a number of gluster errors (see below).

I'm using gluster as the backend for my NFS storage.  I have gluster 
running on two nodes, nfs01 and nfs02.  It's mounted on /n on each host. 
  The path /n is in turn shared out by NFS Ganesha.  It's a two node 
setup with quorum disabled as noted below:

[root@nfs02 ganesha]# mount|grep gv01
nfs02:/gv01 on /n type fuse.glusterfs 
(rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072) 

[root@nfs01 glusterfs]# mount|grep gv01
nfs01:/gv01 on /n type fuse.glusterfs 
(rw,relatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072) 

Gluster always reports as working no matter when I type the below two 
commands:

[root@nfs01 glusterfs]# gluster volume info

Volume Name: gv01
Type: Replicate
Volume ID: e5ccc75e-5192-45ac-b410-a34ebd777666
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: nfs01:/bricks/0/gv01
Brick2: nfs02:/bricks/0/gv01
Options Reconfigured:
cluster.server-quorum-type: none
cluster.quorum-type: none
server.event-threads: 8
client.event-threads: 8
performance.readdir-ahead: on
performance.write-behind-window-size: 8MB
performance.io-thread-count: 16
performance.cache-size: 1GB
nfs.trusted-sync: on
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
[root@nfs01 glusterfs]# gluster status
unrecognized word: status (position 0)
[root@nfs01 glusterfs]# gluster volume status
Status of volume: gv01
Gluster process                             TCP Port  RDMA Port  Online  
Pid
------------------------------------------------------------------------------ 

Brick nfs01:/bricks/0/gv01                  49152     0          Y 1422
Brick nfs02:/bricks/0/gv01                  49152     0          Y 1422
Self-heal Daemon on localhost               N/A       N/A        Y 1248
Self-heal Daemon on nfs02.nix.my.dom       N/A       N/A        Y       
1251

Task Status of Volume gv01
------------------------------------------------------------------------------ 

There are no active volume tasks

[root@nfs01 glusterfs]#

[root@nfs01 glusterfs]# rpm -aq|grep -Ei gluster
glusterfs-3.13.2-2.el7.x86_64
glusterfs-devel-3.13.2-2.el7.x86_64
glusterfs-fuse-3.13.2-2.el7.x86_64
glusterfs-api-devel-3.13.2-2.el7.x86_64
centos-release-gluster313-1.0-1.el7.centos.noarch
python2-gluster-3.13.2-2.el7.x86_64
glusterfs-client-xlators-3.13.2-2.el7.x86_64
glusterfs-server-3.13.2-2.el7.x86_64
libvirt-daemon-driver-storage-gluster-3.2.0-14.el7_4.9.x86_64
glusterfs-cli-3.13.2-2.el7.x86_64
centos-release-gluster312-1.0-1.el7.centos.noarch
python2-glusterfs-api-1.1-1.el7.noarch
glusterfs-libs-3.13.2-2.el7.x86_64
glusterfs-extra-xlators-3.13.2-2.el7.x86_64
glusterfs-api-3.13.2-2.el7.x86_64
[root@nfs01 glusterfs]#

The short of it is that everything works and mounts on guests work as 
long as I don't try to write to the NFS share from my clients.  If I try 
to write to the share, everything comes apart like this:

-sh-4.2$ pwd
/n/my.dom/tom
-sh-4.2$ ls -altri
total 6258
11715278280495367299 -rw-------. 1 tom@xxxxxx tom@xxxxxx     231 Feb 17 
20:15 .bashrc
10937819299152577443 -rw-------. 1 tom@xxxxxx tom@xxxxxx     193 Feb 17 
20:15 .bash_profile
10823746994379198104 -rw-------. 1 tom@xxxxxx tom@xxxxxx      18 Feb 17 
20:15 .bash_logout
10718721668898812166 drwxr-xr-x. 3 root        root           4096 Mar 5 
02:46 ..
12008425472191154054 drwx------. 2 tom@xxxxxx tom@xxxxxx    4096 Mar 18 
03:07 .ssh
13763048923429182948 -rw-rw-r--. 1 tom@xxxxxx tom@xxxxxx 6359568 Mar 25 
22:38 opennebula-cores.tar.gz
11674701370106210511 -rw-rw-r--. 1 tom@xxxxxx tom@xxxxxx       4 Apr  9 
23:25 meh.txt
  9326637590629964475 -rw-r--r--. 1 tom@xxxxxx tom@xxxxxx   24970 May  1 
01:30 nfs-trace-working.dat.gz
  9337343577229627320 -rw-------. 1 tom@xxxxxx tom@xxxxxx    3734 May  1 
23:38 .bash_history
11438151930727967183 drwx------. 3 tom@xxxxxx tom@xxxxxx    4096 May  1 
23:58 .
  9865389421596220499 -rw-r--r--. 1 tom@xxxxxx tom@xxxxxx    4096 May  1 
23:58 .meh.txt.swp
-sh-4.2$ touch test.txt
-sh-4.2$ vi test.txt
-sh-4.2$ ls -altri
ls: cannot open directory .: Permission denied
-sh-4.2$ ls -altri
ls: cannot open directory .: Permission denied
-sh-4.2$ ls -altri

This is followed by a slew of other errors in apps using the gluster 
volume.  These errors include:

02/05/2018 23:10:52 : epoch 5aea7bd5 : nfs02.nix.my.dom : 
ganesha.nfsd-5891[svc_12] nfs_rpc_process_request :DISP :INFO :Could not 
authenticate request... rejecting with AUTH_STAT=RPCSEC_GSS_CREDPROBLEM

==> ganesha-gfapi.log <==
[2018-05-03 04:32:18.009245] I [MSGID: 114021] [client.c:2369:notify] 
0-gv01-client-0: current graph is no longer active, destroying rpc_client
[2018-05-03 04:32:18.009338] I [MSGID: 114021] [client.c:2369:notify] 
0-gv01-client-1: current graph is no longer active, destroying rpc_client
[2018-05-03 04:32:18.009499] I [MSGID: 114018] 
[client.c:2285:client_rpc_notify] 0-gv01-client-0: disconnected from 
gv01-client-0. Client process will keep trying to connect to glusterd 
until brick's port is available
[2018-05-03 04:32:18.009557] I [MSGID: 114018] 
[client.c:2285:client_rpc_notify] 0-gv01-client-1: disconnected from 
gv01-client-1. Client process will keep trying to connect to glusterd 
until brick's port is available
[2018-05-03 04:32:18.009610] E [MSGID: 108006] 
[afr-common.c:5164:__afr_handle_child_down_event] 0-gv01-replicate-0: 
All subvolumes are down. Going offline until atleast one of them comes 
back up.

[2018-05-01 22:43:06.412067] E [MSGID: 114058] 
[client-handshake.c:1571:client_query_portmap_cbk] 0-gv01-client-1: 
failed to get the port number for remote subvolume. Please run 'gluster 
volume status' on server to see if brick process is running.
[2018-05-01 22:43:55.554833] E [socket.c:2374:socket_connect_finish] 
0-gv01-client-0: connection to 192.168.0.131:49152 failed (Connection 
refused); disconnecting socket

So I'm wondering, if this is due to the two node gluster, as it seems to 
be, and what is it that I really need to do here?  Should I go with the 
recommended 3 node setup to avoid this which would include a proper 
quorum?  Or is there more to this and it really doesn't matter if I have 
a 2 node gluster cluster without a quorum and this is due to something 
else still?

Again, anytime I check the gluter volumes, everything checks out.  The 
results of both 'gluster volume info' and 'gluster volume status' is 
always as I pasted above, fully working.

I'm also using the Linux KDC Free IPA with this solution as well.

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-users