Re: overlayfs: NFS lowerdir changes & opaque negative lookups

Daire Byrne <daire@xxxxxxxx> · Mon, 15 Jul 2024 11:50:20 +0100

On Sun, 14 Jul 2024 at 05:12, Mike Baynton <mike@xxxxxxxxxxxx> wrote:
>
> On 7/12/24 07:04, Daire Byrne wrote:
> > Yea, so I have also toyed with the "composefs" idea
>
> Yeah, I'm doing what they're doing but making the EROFS in-house and
> hoping the kinda-writable NFS twist isn't an issue. I only need to
> satisfy dependencies for a container's worth of software at a time and I
> can determine all the dependencies I need by virtue of tooling in the
> software ecosystems I need to support.

Yea, I need to check out EROFS at some point. But many of our desktop
kernels are just too old atm.

Overall the idea of hand crafting metadata-only overlays is compelling
because you can avoid the complexity (and confusion) of using symlinks
and it's extremely lightweight (maybe even more so with EROFS).

> > I guess the difference is that I'm not trying to replicate the
> > entirety of the metadata, I just want to tweak bits of it and still
> > avail of the overlay merged directories to fall through to the
> > directory tree and data underneath for everything else.
>
> Yeah I understand your objective now. I'm mildly curious why NFS +
> fscache doesn't solve the negative lookups case for you given that you
> want a dynamically generated local cache. Is fscache just unable to
> cache negative lookups, and you want it to persist for weeks?

Well, fscache is for caching the data contained in (existing) files
only right? It makes no attempt to deal with the metadata (e.g.
directories)?

Or at least I don't know how effective a disk based cache of metadata
could be compared to the vfs page cache when you still need to
revalidate fairly often. I mean it needs to revalidate the file often
(actimeo?) before it can serve the cached copy so it needs the remote
metadata lookups?

I have seen some talk (David Howells) about giving network filesystems
like NFS the ability to have "disconnected" access via netfs/fscache
(ala AFS) but I don't know if that is still on the cards.

The issue we see is that not only do our batch systems cycle through
lots of different software versions per hour, but we have many
thousands of clients doing the same to a single (Netapp) software
volume. Even if each NFS client managed to cache 80% of the negative
lookups between runs, the 20% that hits the Netapp is still quite
significant from many clients.

And even forgiving the server load implications, a Netapp on the LAN
(0.2ms) can add delay when you deal with many "pointless" negative
lookups. I have seen some of our software do 100,000 negative lookups
across 250 lib dirs, which although on paper should only be 100000 *
0.2ms = 20 seconds, actually adds almost a minute to the startup time
of the software. Certainly when we use a local filesytem overlay the
time drops by a minute anyway (most likely because the actual file
opens benefit too). Now if the software only runs for 2 minutes, then
1/3 of its time is spent doing negative lookups/path walking at
startup.

Yes, I am aware that our software is not well optimised but our build
system and environment is what it is at this point.

I have seen many other novel solutions to this general problem - some examples:

https://guix.gnu.org/en/blog/2021/taming-the-stat-storm-with-a-loader-cache
https://computing.llnl.gov/projects/spindle

> Also (only semi-related) since you have a large NFS deployment similar
> to the one I'm putting together in terms of read-only to normal clients
> and most files/paths being immutable after they first appear, I'd be
> interested in any experiences you've had in practice with performance of
> fscache and NFS mount options that relax its cache coherence / atomicity
> semantics. I've found it impossible to avoid roundtrips to the server on
> each fopen for locally cached files (unless using NFS4 delegation which
> is overkill and not available in my environment.) These RPC roundtrips
> provide no real benefit to our use case but can add seconds of delay to
> initializing a process if it accesses thousands of little interpreted
> language files.

In my experience actimeo>3600 can help for these kinds of read-only
filesystems but you probably need "nocto" to really get it down to
almost no repeat network traffic at all (when cached). Setting
vm.vfs_cache_pressure=1 might also help keep the nfs inode data in
memory longer too?

But nocto will also cache the "ls -l" case whereby you won't see new
entries. However, if you know a new dir/file is there and access it,
it will do the new lookup and find it (dirent not in cache yet). That
might work for your case by the sounds of it?

I'm not too sure about how that effects opens specifically though. In
fact, using NFSv3 might be more "relaxed" in this regard than NFSv4?

In general, our entire pipeline deals with unique versioned files.
Apart from home directories, I can't think of many places where we
overwrite or append to existing files for production workloads or
reuse file paths in any way.

> Not an overlayfs concern in any way though so perhaps no need to pollute
> the mailing list further; if you are interested in responding to me on
> these things continuing off list would be fine with me too.
>
> >> I think Daire and I are basically only adding new files to the NFS
> >> filesystem, and both the all-opaque approach and the data-only approach
> >> could prevent accidental access to things on the NFS filesystem through
> >> the overlayfs (or at least portion of it meant for end-user consumption)
> >> while they are still being birthed and might be experiencing changes.
> >> At some point in the NFS tree, directories must be modified, but since
> >> both approaches have overlayfs sourcing all directory entries from local
> >> metadata-only layers, it seems plausible that the directories that
> >> change aren't really "accessed by a overlayfs prior to the change."
> >
> > Yes, I think your case has a good chance of being safe and becoming
> > well defined behaviour.
> >
> > But my idea was still very much relying on using the majority of the
> > lower layer as is. And for all the reasons given, I suspect my use
> > case is still a no-no.
>
> I dunno, your thing might end up working out fine, based on your latest
> testing of when clients see changes and Amir's observation that all fds
> need to be closed but then a readdir through an overlayfs will observe
> changes. Seems "unlikely" that clients would hold open fds to the first
> few levels of directories at all, never mind for long enough for someone
> to call you and ask where the new version is :)

Yea, I think it is probably fine. Maybe another clarification for the
docs that others might find useful too?

Daire