Re: NFS performance problem (readdir, getattr, actimeo, lookupcache)

Benjamin Coddington <bcodding@xxxxxxxxxx> · Tue, 06 Dec 2022 08:21:27 -0500

On 5 Dec 2022, at 21:18, Theodor Mittermair wrote:

> Hello,

Hi Theodor,

.. snip ..

> From what i gathered around the internet and understood, there seem to be
> heuristics involved when the client decides what operations to transmit to
> the server.  Also, the timed-out cache seems to be creating what some
> called a "getattr storm", which i understand in theory.

When `du` gathers information, it does so by switching between two syscalls:
getdents() and stat() (or some equivalents).  The getdents() syscall causes
the NFS client to perform either READDIR or READDIRPLUS - the choice of
which is governed by a heuristic.  The heuristic can only intelligently
determine which readdir operation to use based on whether the program is
performing this pattern of getdents(), stat(), stat(), stat(), getdents(),
stat(), stat(), stat().  The way it can tell is by checking if each inode's
attributes have been cached, so the cache timeouts end up coming into play.

> But why does the first request manage to be smarter about it, since it
> gathers the same information about the exact same files?

It's not smarter, it just optimistically uses READDIRPLUS on the very first
call of getdents() for a directory, but can only do so if the directory's
dentries have not yet been cached.  If they /are/ cached, but each dentry's
individual attributes have timed out, then the client must send an
individual GETATTR for each entry.

What is happening for you is that your attribute caches for each inode are
timing out, but the overall directory's dentry list is not changing.
There's no need to send /any/ readdir operations - so the heuristic doesn't
send READDIRPLUS and you end up with one full pile of getdents() results of
individual GETATTRs for every entry.

If your server is returning a large dtpref (the preferred data transfer size
for readdir), and there's some latency for round-trip operations, you'll see
this stack up quickly in exactly the results you've presented.

There's a patch that may go into v6.2 to help this:
https://lore.kernel.org/linux-nfs/20220920170021.1391560-1-bcodding@xxxxxxxxxx/

.. if you have the ability to test it in your setup, I'd be interested in
the results.

This heuristic's behavior is becoming harder to change, because over time we
have a lot of setups depending on certain performance characteristics and
changes in this area create unexpected performance regressions.

> I would be happy if i could maintain the initial-non-cached time (in the
> examples above 1.5 seconds) but none of
> "noac","lookupcache=none","actimeo=0" would let me achieve that seemingly.
>
> Is there a way to improve that situation, and if so, how?

Hopefully, the above patch will help.  We've all had wild ideas: maybe we
should also only do uncached readdir if lookupcache=none?  Its a bit
surprising that you'd opt to forego all caching just to optmize this `du`
case.  I don't think that's what you want, as it will negatively impact
other workloads.

I also think that if you were to dump all the directories' page caches
in between your calls to `du` you'd get consistent performance as in your
first pass.. something with POSIX_FADV_DONTNEED to fadvise(), but I'd be
leery of depending on this behavior, since its only a hint.

I also wonder if glibc() might be willing to check a hint (like an
environment variable?) about how big a buffer to send to getdents(), since I
suspect it might also be nice for some fuse filesystems.

Ben