[PATCH v2 0/2] NFSD: handling memory shortage problem with Courteous server

Dai Ngo <dai.ngo@xxxxxxxxxx> · Mon, 4 Jul 2022 12:05:41 -0700

Currently the idle timeout for courtesy client is fixed at 1 day. If
there are lots of courtesy clients remain in the system it can cause
memory resource shortage that effects the operations of other modules
in the kernel. This problem can be observed by running pynfs nfs4.0
CID5 test in a loop. Eventually system runs out of memory and rpc.gssd
fails to add new watch:

rpc.gssd[3851]: ERROR: inotify_add_watch failed for nfsd4_cb/clnt6c2e:
                No space left on device

and alloc_inode also fails with out of memory:

Call Trace:
<TASK>
        dump_stack_lvl+0x33/0x42
        dump_header+0x4a/0x1ed
        oom_kill_process+0x80/0x10d
        out_of_memory+0x237/0x25f
        __alloc_pages_slowpath.constprop.0+0x617/0x7b6
        __alloc_pages+0x132/0x1e3
        alloc_slab_page+0x15/0x33
        allocate_slab+0x78/0x1ab
        ? alloc_inode+0x38/0x8d
        ___slab_alloc+0x2af/0x373
        ? alloc_inode+0x38/0x8d
        ? slab_pre_alloc_hook.constprop.0+0x9f/0x158
        ? alloc_inode+0x38/0x8d
        __slab_alloc.constprop.0+0x1c/0x24
        kmem_cache_alloc_lru+0x8c/0x142
        alloc_inode+0x38/0x8d
        iget_locked+0x60/0x126
        kernfs_get_inode+0x18/0x105
        kernfs_iop_lookup+0x6d/0xbc
        __lookup_slow+0xb7/0xf9
        lookup_slow+0x3a/0x52
        walk_component+0x90/0x100
        ? inode_permission+0x87/0x128
        link_path_walk.part.0.constprop.0+0x266/0x2ea
        ? path_init+0x101/0x2f2
        path_lookupat+0x4c/0xfa
        filename_lookup+0x63/0xd7
        ? getname_flags+0x32/0x17a
        ? kmem_cache_alloc+0x11f/0x144
        ? getname_flags+0x16c/0x17a
        user_path_at_empty+0x37/0x4b
        do_readlinkat+0x61/0x102
        __x64_sys_readlinkat+0x18/0x1b
        do_syscall_64+0x57/0x72
        entry_SYSCALL_64_after_hwframe+0x46/0xb0

This patch addresses this problem by:

   . removing the fixed 1-day idle time limit for courtesy client.
     Courtesy client is now allowed to remain valid as long as the
     available system memory is above 80%.

   . when available system memory drops below 80%, laundromat starts
     trimming older courtesy clients. The number of courtesy clients
     to trim is a percentage of the total number of courtesy clients
     exist in the system.  This percentage is computed based on
     the current percentage of available system memory.

   . the percentage of number of courtesy clients to be trimmed
     is based on this table:

     ----------------------------------
     |  % memory | % courtesy clients |
     | available |    to trim         |
     ----------------------------------
     |  > 80     |      0             |
     |  > 70     |     10             |
     |  > 60     |     20             |
     |  > 50     |     40             |
     |  > 40     |     60             |
     |  > 30     |     80             |
     |  < 30     |    100             |
     ----------------------------------

   . due to the overhead associated with removing client record,
     there is a limit of 128 clients to be trimmed for each
     laundromat run. This is done to prevent the laundromat from
     spending too long destroying the clients and misses performing
     its other tasks in a timely manner.

   . the laundromat is scheduled to run sooner if there are more
     courtesy clients need to be destroyed.

The shrinker method was evaluated and found it's not suitable
for this problem due to these reasons: 

. destroying the NFSv4 client on the shrinker context can cause
  deadlock since nfsd_file_put calls into the underlying FS
  code and we have no control what it will do as seen in this
  stack trace:

 ======================================================
 WARNING: possible circular locking dependency detected
 5.19.0-rc2_sk+ #1 Not tainted
 ------------------------------------------------------
 lck/31847 is trying to acquire lock:
 ffff88811d268850 (&sb->s_type->i_mutex_key#16){+.+.}-{3:3}, at: btrfs_inode_lock+0x38/0x70
 #012but task is already holding lock:
 ffffffffb41848c0 (fs_reclaim){+.+.}-{0:0}, at: __alloc_pages_slowpath.constprop.0+0x506/0x1db0
 #012which lock already depends on the new lock.
 #012the existing dependency chain (in reverse order) is:

 #012-> #1 (fs_reclaim){+.+.}-{0:0}:
       fs_reclaim_acquire+0xc0/0x100
       __kmalloc+0x51/0x320
       btrfs_buffered_write+0x2eb/0xd90
       btrfs_do_write_iter+0x6bf/0x11c0
       do_iter_readv_writev+0x2bb/0x5a0
       do_iter_write+0x131/0x630
       nfsd_vfs_write+0x4da/0x1900 [nfsd]
       nfsd4_write+0x2ac/0x760 [nfsd]
       nfsd4_proc_compound+0xce8/0x23e0 [nfsd]
       nfsd_dispatch+0x4ed/0xc10 [nfsd]
       svc_process_common+0xd3f/0x1b00 [sunrpc]
       svc_process+0x361/0x4f0 [sunrpc]
       nfsd+0x2d6/0x570 [nfsd]
       kthread+0x2a1/0x340
       ret_from_fork+0x22/0x30

 #012-> #0 (&sb->s_type->i_mutex_key#16){+.+.}-{3:3}:
       __lock_acquire+0x318d/0x7830
       lock_acquire+0x1bb/0x500
       down_write+0x82/0x130
       btrfs_inode_lock+0x38/0x70
       btrfs_sync_file+0x280/0x1010
       nfsd_file_flush.isra.0+0x1b/0x220 [nfsd]
       nfsd_file_put+0xd4/0x110 [nfsd]
       release_all_access+0x13a/0x220 [nfsd]
       nfs4_free_ol_stateid+0x40/0x90 [nfsd]
       free_ol_stateid_reaplist+0x131/0x210 [nfsd]
       release_openowner+0xf7/0x160 [nfsd]
       __destroy_client+0x3cc/0x740 [nfsd]
       nfsd_cc_lru_scan+0x271/0x410 [nfsd]
       shrink_slab.constprop.0+0x31e/0x7d0
       shrink_node+0x54b/0xe50
       try_to_free_pages+0x394/0xba0
       __alloc_pages_slowpath.constprop.0+0x5d2/0x1db0
       __alloc_pages+0x4d6/0x580
       __handle_mm_fault+0xc25/0x2810
       handle_mm_fault+0x136/0x480
       do_user_addr_fault+0x3d8/0xec0
       exc_page_fault+0x5d/0xc0
       asm_exc_page_fault+0x27/0x30
 #012other info that might help us debug this:
 Possible unsafe locking scenario:
       CPU0                    CPU1
       ----                    ----
  lock(fs_reclaim);
                               lock(&sb->s_type->i_mutex_key#16);
                               lock(fs_reclaim);
  lock(&sb->s_type->i_mutex_key#16);
 #012 *** DEADLOCK ***

. the shrinker kicks in only when memory drops really low, ~<5%.
  By this time, some other components in the system already run
  into issue with memory shortage. For example, rpc.gssd starts
  failing to add watches in /var/lib/nfs/rpc_pipefs/nfsd4_cb
  once the memory consumed by these watches reaches about 1% of
  available system memory.

. destroying the NFSv4 client has significant overhead due to
  the upcall to user space to remove the client records which
  might access storage device. There is potential deadlock
  if the storage subsystem needs to allocate memory.