NFS 4.0 bad RIP

Chris Zimman <cwz@xxxxxxxxxxxx> · Fri, 23 Oct 2020 03:05:12 +0000

Hi
 
We are seeing an issue that I haven't seen reported anywhere else yet.  We have some machines running 4.19.104 (w/ one additional patch, more on that in a moment) that after some period of time always seem to die with an NFS client related kernel RIP.  The machines are mounting Kerberized NFS 4.0:
 
	e.g.

	(rw,noatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,acregmax=120,acdirmax=120,hard,proto=tcp,timeo=600,retrans=2,sec=krb5p,clientaddr=172.20.21.7,local_lock=none,addr=172.20.21.4)
 
For the avoidance of doubt (and although this would very much seem to be a client-side issue), the server side details are:

Running the same kernel release, the server is exporting a ZFS (0.8.4) volume w/ sharenfs disabled (e.g. using knfsd).  It's running the release of NFS utils included w/ Debian 9 and we run rpc.svcgssd (e.g. vs gssproxy) because we ran into issues w/ gssproxy and some users (see patch + notes below).
The NFS server itself never seems at all bothered, as it just keeps chugging along, irrespective of client hangs.  When the clients hang, they get rebooted, and after they come back up, go happily on their way until the bug manifests again.
 
The client machines seem to eventually (15-40 days avg) die with a trace that typically looks like this:
 
[978587.033741] INFO: task ApplicationImpl:1123136 blocked for more than 120 seconds.
[978587.033743]       Not tainted 4.19.104-mars #31
[978587.033743] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[978587.033744] ApplicationImpl D    0 1123136 1122989 0x00000000
[978587.033746] Call Trace:
[978587.033752]  __schedule+0x3f9/0x880
[978587.033755]  ? __switch_to_asm+0x35/0x70
[978587.033756]  ? __switch_to_asm+0x41/0x70
[978587.033758]  ? bit_wait+0x60/0x60
[978587.033759]  schedule+0x36/0x80
[978587.033760]  io_schedule+0x16/0x40
[978587.033761]  bit_wait_io+0x11/0x60
[978587.033763]  __wait_on_bit+0x64/0x90
[978587.033765]  out_of_line_wait_on_bit+0x8e/0xb0
[978587.033767]  ? init_wait_var_entry+0x50/0x50
[978587.033781]  nfs_wait_on_request+0x46/0x50 [nfs]
[978587.033788]  nfs_lock_and_join_requests+0x121/0x510 [nfs]
[978587.033790]  ? radix_tree_lookup_slot+0x22/0x50
[978587.033796]  nfs_updatepage+0x168/0x970 [nfs]
[978587.033802]  nfs_write_end+0xe1/0x3a0 [nfs]
[978587.033805]  generic_perform_write+0xff/0x1b0
[978587.033810]  nfs_file_write+0xd7/0x250 [nfs]
[978587.033811]  __vfs_write+0x10f/0x1a0
[978587.033812]  vfs_write+0xb5/0x1a0
[978587.033813]  ksys_write+0x5c/0xe0
[978587.033815]  __x64_sys_write+0x1a/0x20
[978587.033817]  do_syscall_64+0x5a/0x100
[978587.033818]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[978587.033820] RIP: 0033:0x7f62feb121cd
[978587.033824] Code: Bad RIP value.
[978587.033825] RSP: 002b:00007f61ac5adb50 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
[978587.033826] RAX: ffffffffffffffda RBX: 0000000000002000 RCX: 00007f62feb121cd
[978587.033827] RDX: 0000000000002000 RSI: 00007f61ac5adbb0 RDI: 0000000000000222
[978587.033827] RBP: 00007f61ac5adb80 R08: 0000000000000010 R09: 000000070950c5a0
[978587.033828] R10: 0000000000000f80 R11: 0000000000000293 R12: 0000000000002000
[978587.033828] R13: 00007f61ac5adbb0 R14: 0000000000000222 R15: 00007f62e333a500
[978587.033857] INFO: task ApplicationImpl:1124088 blocked for more than 120 seconds.

The one patch that we have on the machines is in net/sunrpc/cache.c:
 
diff -ruBb net/sunrpc/cache.c ../../linux-4.19.104/net/sunrpc/cache.c

--- net/sunrpc/cache.c  2020-02-14 21:33:28.000000000 +0000
+++ ../../linux-4.19.104/net/sunrpc/cache.c     2020-03-03 21:35:41.330763655 +0000
@@ -37,6 +37,7 @@
#include "netns.h" #define         RPCDBG_FACILITY RPCDBG_CACHE
+#define  BIG_PAGE_SIZE (PAGE_SIZE << 2) static bool cache_defer_req(struct cache_req *req, struct cache_head *item);
static void cache_revisit_request(struct cache_head *item);
@@ -539,7 +540,7 @@
  * it to be revisited when cache info is available
  */
-#define        DFR_HASHSIZE    (PAGE_SIZE/sizeof(struct list_head))
+#define        DFR_HASHSIZE    (BIG_PAGE_SIZE/sizeof(struct list_head))
#define        DFR_HASH(item)  ((((long)item)>>4 ^ (((long)item)>>13)) % DFR_HASHSIZE) 
 #define        DFR_MAX 300     /* ??? */
@@ -769,12 +770,12 @@
                               struct cache_request *crq)
{
        char *bp = crq->buf;
-       int len = PAGE_SIZE;
+       int len = BIG_PAGE_SIZE;        
                             detail->cache_request(detail, crq->item, &bp, &len);
        if (len < 0)
                return -EAGAIN;
-       return PAGE_SIZE - len;
+       return BIG_PAGE_SIZE - len;
} static ssize_t cache_read(struct file *filp, char __user *buf, size_t count,
@@ -1188,7 +1189,8 @@
                /* Too late to make an upcall */
                return -EAGAIN;-       buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
 
+       buf = kmalloc(BIG_PAGE_SIZE, GFP_KERNEL);
        if (!buf)
                return -EAGAIN;
                                                                                                                   
We added this patch because there was a problem where some users seemed to belong to too many groups, and the up/down calls to Kerberos were failing, leaving them unable to access their home directories.
I mention this because I'd like to rule out the likelihood that this patch may be causing issues somewhere down the line and perhaps related to the issue we're seeing.
 
Has anyone seen anything similar to this, or perhaps has any thoughts on what might be going on here?
 
Thanks
 
--Chris