Re: [PATCH v2 0/2] NFSD: handling memory shortage problem with Courteous server

dai.ngo@xxxxxxxxxx · Tue, 5 Jul 2022 11:42:22 -0700

On 7/5/22 7:50 AM, Chuck Lever III wrote:
Hello Dai -

I agree that tackling resource management is indeed an appropriate
next step for courteous server. Thanks for tackling this!

More comments are inline.

On Jul 4, 2022, at 3:05 PM, Dai Ngo <dai.ngo@xxxxxxxxxx> wrote:

Currently the idle timeout for courtesy client is fixed at 1 day. If
there are lots of courtesy clients remain in the system it can cause
memory resource shortage that effects the operations of other modules
in the kernel. This problem can be observed by running pynfs nfs4.0
CID5 test in a loop. Eventually system runs out of memory and rpc.gssd
fails to add new watch:

rpc.gssd[3851]: ERROR: inotify_add_watch failed for nfsd4_cb/clnt6c2e:
                No space left on device

and alloc_inode also fails with out of memory:

Call Trace:
<TASK>
        dump_stack_lvl+0x33/0x42
        dump_header+0x4a/0x1ed
        oom_kill_process+0x80/0x10d
        out_of_memory+0x237/0x25f
        __alloc_pages_slowpath.constprop.0+0x617/0x7b6
        __alloc_pages+0x132/0x1e3
        alloc_slab_page+0x15/0x33
        allocate_slab+0x78/0x1ab
        ? alloc_inode+0x38/0x8d
        ___slab_alloc+0x2af/0x373
        ? alloc_inode+0x38/0x8d
        ? slab_pre_alloc_hook.constprop.0+0x9f/0x158
        ? alloc_inode+0x38/0x8d
        __slab_alloc.constprop.0+0x1c/0x24
        kmem_cache_alloc_lru+0x8c/0x142
        alloc_inode+0x38/0x8d
        iget_locked+0x60/0x126
        kernfs_get_inode+0x18/0x105
        kernfs_iop_lookup+0x6d/0xbc
        __lookup_slow+0xb7/0xf9
        lookup_slow+0x3a/0x52
        walk_component+0x90/0x100
        ? inode_permission+0x87/0x128
        link_path_walk.part.0.constprop.0+0x266/0x2ea
        ? path_init+0x101/0x2f2
        path_lookupat+0x4c/0xfa
        filename_lookup+0x63/0xd7
        ? getname_flags+0x32/0x17a
        ? kmem_cache_alloc+0x11f/0x144
        ? getname_flags+0x16c/0x17a
        user_path_at_empty+0x37/0x4b
        do_readlinkat+0x61/0x102
        __x64_sys_readlinkat+0x18/0x1b
        do_syscall_64+0x57/0x72
        entry_SYSCALL_64_after_hwframe+0x46/0xb0
These details are a little distracting. IMO you can summarize
the above with just this:

Currently the idle timeout for courtesy client is fixed at 1 day. If
there are lots of courtesy clients remain in the system it can cause
memory resource shortage. This problem can be observed by running
pynfs nfs4.0 CID5 test in a loop.

Now I'm going to comment in reverse order here. To add context
for others on-list, when we designed courteous server, we had
assumed that eventually a shrinker would be used to garbage
collect courtesy clients. Dai has found some issues with that
approach:

The shrinker method was evaluated and found it's not suitable
for this problem due to these reasons:

. destroying the NFSv4 client on the shrinker context can cause
  deadlock since nfsd_file_put calls into the underlying FS
  code and we have no control what it will do as seen in this
  stack trace:
[ ... stack trace snipped ... ]

I think I always had in mind that only the laundromat would be
responsible for harvesting courtesy clients. A shrinker might
trigger that activity, but as you point out, a deadlock is pretty
likely if the shrinker itself had to do the harvesting.

. destroying the NFSv4 client has significant overhead due to
  the upcall to user space to remove the client records which
  might access storage device. There is potential deadlock
  if the storage subsystem needs to allocate memory.
The issue is that harvesting a courtesy client will involve
an upcall to nfsdcltracker, and that will result in I/O that
updates the tracker's database. Very likely this will require
further allocation of memory and thus it could deadlock the
system.

Now this might also be all the demonstration that we need
that managing courtesy resources cannot be done using the
system's shrinker facility -- expiring a client can never
be done when there is a direct reclaim waiting on it. I'm
interested in other opinions on that. Neil? Bruce? Trond?

. the shrinker kicks in only when memory drops really low, ~<5%.
  By this time, some other components in the system already run
  into issue with memory shortage. For example, rpc.gssd starts
  failing to add watches in /var/lib/nfs/rpc_pipefs/nfsd4_cb
  once the memory consumed by these watches reaches about 1% of
  available system memory.
Your claim is that a courtesy client shrinker would be invoked
too late. That might be true on a server with 2GB of RAM, but
on a big system (say, a server with 64GB of RAM), 5% is still
more than 3GB -- wouldn't that be enough to harvest safely?

We can't optimize for tiny server systems because that almost
always hobbles the scalability of larger systems for no good
reason. Can you test with a large-memory server as well as a
small-memory server?

I don't have a system with large memory configuration, my VM has
only 6GB of memory.

I think the shrinker is not an option due to the deadlock problem
so I think we just concentrate on the laundromat route.

I think the central question here is why is 5% not enough on
all systems. I would like to understand that better. It seems
like a primary scalability question that needs an answer so
a good harvesting heuristic can be derived.

One question in my mind is what is the maximum rate at which
the server converts active clients to courtesy clients, and
can the current laundromat scheme keep up with harvesting them
at that rate? The destructive scenario seems to be when courtesy
clients are manufactured faster than they can be harvested and
expunged.

That seems to be the case. Currently the laundromat destroys idle
courtesy clients after 1 day and running CID5 in a loop generates
a ton of courtesy clients. Before the 1-day expiration occurs,
memory already drops to almost <1% and problems with rpc.gssd and
memory allocation were seen as mentioned above.

(Also I recall Bruce fixed a problem recently with nfsdcltracker
where it was doing three fsync's for every database update,
which significantly slowed it down. You should look for that
fix in nfs-utils and ensure the above rate measurement is done
with the fix applied).

will do.

This patch addresses this problem by:

   . removing the fixed 1-day idle time limit for courtesy client.
     Courtesy client is now allowed to remain valid as long as the
     available system memory is above 80%.

   . when available system memory drops below 80%, laundromat starts
     trimming older courtesy clients. The number of courtesy clients
     to trim is a percentage of the total number of courtesy clients
     exist in the system.  This percentage is computed based on
     the current percentage of available system memory.

   . the percentage of number of courtesy clients to be trimmed
     is based on this table:

     ----------------------------------
     |  % memory | % courtesy clients |
     | available |    to trim         |
     ----------------------------------
     |  > 80     |      0             |
     |  > 70     |     10             |
     |  > 60     |     20             |
     |  > 50     |     40             |
     |  > 40     |     60             |
     |  > 30     |     80             |
     |  < 30     |    100             |
     ----------------------------------
"80% available memory" on a big system means there's still an
enormous amount of free memory on that system. It will be
surprising to administrators on those systems if the laundromat
is harvesting courtesy clients at that point.

at 80% and above there is no harvesting going on.

Also, if a server is at 60-70% free memory all the time due to
non-NFSD-related memory consumption, would that mean that the
laundromat would always trim courtesy clients, even though doing
so would not be needed or beneficial?

it's true that there is no benefit to harvest courtesy clients
at 60-70% if the available memory stays in this range. But we
don't know whether available memory will stay in this range or
it will continue to drop (as in my test case with CID5). Shouldn't
we start harvest some of the courtesy clients at this point to
be on the safe side?

I don't think we can use a fixed percentage ladder like this;
it might make sense for the CID5 test (or to stop other types of
inadvertent or malicious DoS attacks) but the common case
steady-state behavior doesn't seem very good.

I'm looking for suggestion for better solution to handle this
problem.

I don't recall, are courtesy clients maintained on an LRU so
that the oldest ones would be harvested first?

courtesy clients and 'normal' clients are in the same LRU list
so the oldest ones would be harvested first.

This mechanism seems to harvest at random?

I'm not sure what you mean here?

   . due to the overhead associated with removing client record,
     there is a limit of 128 clients to be trimmed for each
     laundromat run. This is done to prevent the laundromat from
     spending too long destroying the clients and misses performing
     its other tasks in a timely manner.

   . the laundromat is scheduled to run sooner if there are more
     courtesy clients need to be destroyed.
Both of these last two changes seem sensible. Can they be
broken out so they can be applied immediately?

Yes. Do you want me to rework the patch just to have these 2
changes for now while we continue to look for a better solution
than the proposed fixed percentage?

Thanks,
-Dai

--
Chuck Lever