Re: overlayfs: NFS lowerdir changes & opaque negative lookups

Mike Baynton <mike@xxxxxxxxxxxx> · Mon, 15 Jul 2024 22:45:29 -0500

On 7/15/24 05:50, Daire Byrne wrote:
> On Sun, 14 Jul 2024 at 05:12, Mike Baynton <mike@xxxxxxxxxxxx> wrote:
>>
>> On 7/12/24 07:04, Daire Byrne wrote:
>>> Yea, so I have also toyed with the "composefs" idea
>>
>> Yeah, I'm doing what they're doing but making the EROFS in-house and
>> hoping the kinda-writable NFS twist isn't an issue. I only need to
>> satisfy dependencies for a container's worth of software at a time and I
>> can determine all the dependencies I need by virtue of tooling in the
>> software ecosystems I need to support.
> 
> Yea, I need to check out EROFS at some point. But many of our desktop
> kernels are just too old atm.
> 
> Overall the idea of hand crafting metadata-only overlays is compelling
> because you can avoid the complexity (and confusion) of using symlinks
> and it's extremely lightweight (maybe even more so with EROFS).
> 
>>> I guess the difference is that I'm not trying to replicate the
>>> entirety of the metadata, I just want to tweak bits of it and still
>>> avail of the overlay merged directories to fall through to the
>>> directory tree and data underneath for everything else.
>>
>> Yeah I understand your objective now. I'm mildly curious why NFS +
>> fscache doesn't solve the negative lookups case for you given that you
>> want a dynamically generated local cache. Is fscache just unable to
>> cache negative lookups, and you want it to persist for weeks?
> 
> Well, fscache is for caching the data contained in (existing) files
> only right? It makes no attempt to deal with the metadata (e.g.
> directories)?
> 
> Or at least I don't know how effective a disk based cache of metadata
> could be compared to the vfs page cache when you still need to
> revalidate fairly often. I mean it needs to revalidate the file often
> (actimeo?) before it can serve the cached copy so it needs the remote
> metadata lookups?

Yeah never mind my question, I'm not sure if NFS uses fscache to cache
negative lookups, but with long PATHs like you have I think you'd have a
combinatorial explosion of files * paths over long periods and it would
get out of hand.

I've been setting super long actimeo since I know my NFS files "by
design" aren't changing. (We even write them out to locations where the
clients aren't traversing and then rename them.)

> 
> I have seen some talk (David Howells) about giving network filesystems
> like NFS the ability to have "disconnected" access via netfs/fscache
> (ala AFS) but I don't know if that is still on the cards.
> 
> The issue we see is that not only do our batch systems cycle through
> lots of different software versions per hour, but we have many
> thousands of clients doing the same to a single (Netapp) software
> volume. Even if each NFS client managed to cache 80% of the negative
> lookups between runs, the 20% that hits the Netapp is still quite
> significant from many clients.
> 
> And even forgiving the server load implications, a Netapp on the LAN
> (0.2ms) can add delay when you deal with many "pointless" negative
> lookups. I have seen some of our software do 100,000 negative lookups
> across 250 lib dirs, which although on paper should only be 100000 *
> 0.2ms = 20 seconds, actually adds almost a minute to the startup time
> of the software. Certainly when we use a local filesytem overlay the
> time drops by a minute anyway (most likely because the actual file
> opens benefit too). Now if the software only runs for 2 minutes, then
> 1/3 of its time is spent doing negative lookups/path walking at
> startup.
> 
> Yes, I am aware that our software is not well optimised but our build
> system and environment is what it is at this point.
> 
> I have seen many other novel solutions to this general problem - some examples:
> 
> https://guix.gnu.org/en/blog/2021/taming-the-stat-storm-with-a-loader-cache
> https://computing.llnl.gov/projects/spindle
> 
>> Also (only semi-related) since you have a large NFS deployment similar
>> to the one I'm putting together in terms of read-only to normal clients
>> and most files/paths being immutable after they first appear, I'd be
>> interested in any experiences you've had in practice with performance of
>> fscache and NFS mount options that relax its cache coherence / atomicity
>> semantics. I've found it impossible to avoid roundtrips to the server on
>> each fopen for locally cached files (unless using NFS4 delegation which
>> is overkill and not available in my environment.) These RPC roundtrips
>> provide no real benefit to our use case but can add seconds of delay to
>> initializing a process if it accesses thousands of little interpreted
>> language files.
> 
> In my experience actimeo>3600 can help for these kinds of read-only
> filesystems but you probably need "nocto" to really get it down to
> almost no repeat network traffic at all (when cached). Setting
> vm.vfs_cache_pressure=1 might also help keep the nfs inode data in
> memory longer too?
> 
> But nocto will also cache the "ls -l" case whereby you won't see new
> entries. However, if you know a new dir/file is there and access it,
> it will do the new lookup and find it (dirent not in cache yet). That
> might work for your case by the sounds of it?

My issue has been that I can set all the options there are to relax
cache coherency, on a test machine with plenty of memory to cache,
including actimeo and nocto, and I still get some RPCs per open().
Our cloud provider also has worse latency than your 0.2ms.

> 
> I'm not too sure about how that effects opens specifically though. In
> fact, using NFSv3 might be more "relaxed" in this regard than NFSv4?

Brilliant! I had only tried 4.0 and 4.1. I just tried with NFSv3 and can
get down to zero packets over the network easily. Thanks! :)

I think 4.x versions really want you to use delegation, and if you do,
you can get to zero packets over the network for locally cached files,
but if you don't you get an OPEN and CLOSE RPC per file open()ed no
matter what. I don't really want to use delegation because it's an
excessively complex system for "treat this filesystem as read-only." I
fear it would give slow individual client machines too much authority to
 temporarily limit availability to delegated files, and it's not
available in my cloud provider's hosted NFS offering anyway.

> 
> In general, our entire pipeline deals with unique versioned files.
> Apart from home directories, I can't think of many places where we
> overwrite or append to existing files for production workloads or
> reuse file paths in any way.
> 
>> Not an overlayfs concern in any way though so perhaps no need to pollute
>> the mailing list further; if you are interested in responding to me on
>> these things continuing off list would be fine with me too.
>>
>>>> I think Daire and I are basically only adding new files to the NFS
>>>> filesystem, and both the all-opaque approach and the data-only approach
>>>> could prevent accidental access to things on the NFS filesystem through
>>>> the overlayfs (or at least portion of it meant for end-user consumption)
>>>> while they are still being birthed and might be experiencing changes.
>>>> At some point in the NFS tree, directories must be modified, but since
>>>> both approaches have overlayfs sourcing all directory entries from local
>>>> metadata-only layers, it seems plausible that the directories that
>>>> change aren't really "accessed by a overlayfs prior to the change."
>>>
>>> Yes, I think your case has a good chance of being safe and becoming
>>> well defined behaviour.
>>>
>>> But my idea was still very much relying on using the majority of the
>>> lower layer as is. And for all the reasons given, I suspect my use
>>> case is still a no-no.
>>
>> I dunno, your thing might end up working out fine, based on your latest
>> testing of when clients see changes and Amir's observation that all fds
>> need to be closed but then a readdir through an overlayfs will observe
>> changes. Seems "unlikely" that clients would hold open fds to the first
>> few levels of directories at all, never mind for long enough for someone
>> to call you and ask where the new version is :)
> 
> Yea, I think it is probably fine. Maybe another clarification for the
> docs that others might find useful too?
> 
> Daire