On 5 Dec 2022, at 21:18, Theodor Mittermair wrote: > Hello, Hi Theodor, .. snip .. > From what i gathered around the internet and understood, there seem to be > heuristics involved when the client decides what operations to transmit to > the server. Also, the timed-out cache seems to be creating what some > called a "getattr storm", which i understand in theory. When `du` gathers information, it does so by switching between two syscalls: getdents() and stat() (or some equivalents). The getdents() syscall causes the NFS client to perform either READDIR or READDIRPLUS - the choice of which is governed by a heuristic. The heuristic can only intelligently determine which readdir operation to use based on whether the program is performing this pattern of getdents(), stat(), stat(), stat(), getdents(), stat(), stat(), stat(). The way it can tell is by checking if each inode's attributes have been cached, so the cache timeouts end up coming into play. > But why does the first request manage to be smarter about it, since it > gathers the same information about the exact same files? It's not smarter, it just optimistically uses READDIRPLUS on the very first call of getdents() for a directory, but can only do so if the directory's dentries have not yet been cached. If they /are/ cached, but each dentry's individual attributes have timed out, then the client must send an individual GETATTR for each entry. What is happening for you is that your attribute caches for each inode are timing out, but the overall directory's dentry list is not changing. There's no need to send /any/ readdir operations - so the heuristic doesn't send READDIRPLUS and you end up with one full pile of getdents() results of individual GETATTRs for every entry. If your server is returning a large dtpref (the preferred data transfer size for readdir), and there's some latency for round-trip operations, you'll see this stack up quickly in exactly the results you've presented. There's a patch that may go into v6.2 to help this: https://lore.kernel.org/linux-nfs/20220920170021.1391560-1-bcodding@xxxxxxxxxx/ .. if you have the ability to test it in your setup, I'd be interested in the results. This heuristic's behavior is becoming harder to change, because over time we have a lot of setups depending on certain performance characteristics and changes in this area create unexpected performance regressions. > I would be happy if i could maintain the initial-non-cached time (in the > examples above 1.5 seconds) but none of > "noac","lookupcache=none","actimeo=0" would let me achieve that seemingly. > > Is there a way to improve that situation, and if so, how? Hopefully, the above patch will help. We've all had wild ideas: maybe we should also only do uncached readdir if lookupcache=none? Its a bit surprising that you'd opt to forego all caching just to optmize this `du` case. I don't think that's what you want, as it will negatively impact other workloads. I also think that if you were to dump all the directories' page caches in between your calls to `du` you'd get consistent performance as in your first pass.. something with POSIX_FADV_DONTNEED to fadvise(), but I'd be leery of depending on this behavior, since its only a hint. I also wonder if glibc() might be willing to check a hint (like an environment variable?) about how big a buffer to send to getdents(), since I suspect it might also be nice for some fuse filesystems. Ben