Re: overlayfs: NFS lowerdir changes & opaque negative lookups

Mike Baynton <mike@xxxxxxxxxxxx> · Thu, 11 Jul 2024 22:24:54 -0500

On 7/11/24 18:30, Amir Goldstein wrote:
> On Thu, Jul 11, 2024 at 6:59 PM Daire Byrne <daire@xxxxxxxx> wrote:
>> Basically I have a read-only NFS filesystem with software releases
>> that are versioned such that no files are ever overwritten or changed.
>> New uniquely named directory trees and files are added from time to
>> time and older ones are cleaned up.
>>
> 
> Sounds like a common use case that many people are interested in.

I can vouch that that's accurate, I'm doing nearly the same thing. The
properties of the NFS filesystem in terms of what is and is not expected
to change is identical for me, though my approach to incorporating
overlayfs has been a little different.

My confidence in the reliability of what I'm doing is still far from
absolute, so I will be interested in efforts to validate/officially
sanction/support/document related techniques.

The way I am doing it is with NFS as a data-only layer. Basically my use
case calls for presenting different views of NFS-backed data (it's
software libraries) to different applications. No application wants or
needs to have the entire NFS tree exposed to it, but each application
wants to use some data available on NFS and wants it to be presented in
some particular local place. So I actually wanted a method where I
author a metadata-only layer external to overlayfs, built to spec.

Essentially it's making overlayfs redirects be my symlinks so that code
which doesn't follow symlinks or is otherwise influenced by them is none
the wiser.

>> My first question is how bad can the "undefined behaviour" be in this
>> kind of setup?
> 
> The behavior is "undefined" because nobody tried to define it,
> document it and test it.
> I don't think it would be that "bad", but it will be unpredictable
> and is not very nice for a software product.
> 
> One of the current problems is that overlayfs uses readdir cache
> the readdir cache is not auto invalidated when lower dir changes
> so whether or not new subdirs are observed in overlay depends
> on whether the merged overlay directory is kept in cache or not.
>

My approach doesn't support adding new files from the data-only NFS
layer after the overlayfs is created, of course, since the metadata-only
layer is itself the first lower layer and so would presumably get into
undefined-land if added to. But this arrangement does probably
mitigate this problem. Creating metadata inodes of a fixed set of
libraries for a specific application is cheap enough (and considerably
faster than copying it all locally) that the immutablity limitation
works for me.

>> Any files that get copied up to the upper layer are
>> guaranteed to never change in the lower NFS filesystem (by it's
>> design), but new directories and files that have not yet been copied
>> up, can randomly appear over time. Deletions are not so important
>> because if it has been deleted in the lower level, then the upper
>> level copy failing has similar results (but we should cleanup the
>> upper layer too).
>>
>> If it's possible to get over this first difficult hurdle, then I have
>> another extra bit of complexity to throw on top - now manually make an
>> entire directory tree (of metdata) that we have recursively copied up
>> "opaque" in the upper layer (currently needs to be done outside of
>> overlayfs). Over time or dropping of caches, I have found that this
>> (seamlessly?) takes effect for new lookups.
>>
>> I also noticed that in the current implementation, this "opaque"
>> transition actual breaks access to the file because the metadata
>> copy-up sets "trusted.overlay.metacopy" but does not currently add an
>> explicit "trusted.overlay.redirect" to the correspnding lower layer
>> file. But if it did (or we do it manually with setfattr), then it is
>> possible to have an upper level directory that is opaque, contains
>> file metadata only and redirects to the data to the real files on the
>> lower NFS filesystem.

So once you use opaque dirs and redirects on an upper layer, it's
sounding very similar to redirects into a data-only layer. In either
case you're responsible for producing metadata inodes for each NFS file
you want presented to the application/user.

This way seems interesting and more promising for adding NFS-backed
files "online" though.

> how can we document it to make the behavior "defined"?
> 
> My thinking is:
> 
> "Changes to the underlying filesystems while part of a mounted overlay
> filesystem are not allowed.  If the underlying filesystem is changed,
> the behavior of the overlay is undefined, though it will not result in
> a crash or deadlock.
> 
> One exception to this rule is changes to underlying filesystem objects
> that were not accessed by a overlayfs prior to the change.
> In other words, once accessed from a mounted overlay filesystem,
> changes to the underlying filesystem objects are not allowed."
> 
> But this claim needs to be proved and tested (write tests),
> before the documentation defines this behavior.
> I am not even sure if the claim is correct.

I've been blissfully and naively assuming that it is based on intuition
:).

I think Daire and I are basically only adding new files to the NFS
filesystem, and both the all-opaque approach and the data-only approach
could prevent accidental access to things on the NFS filesystem through
the overlayfs (or at least portion of it meant for end-user consumption)
while they are still being birthed and might be experiencing changes.
At some point in the NFS tree, directories must be modified, but since
both approaches have overlayfs sourcing all directory entries from local
metadata-only layers, it seems plausible that the directories that
change aren't really "accessed by a overlayfs prior to the change."

How much proving/testing would you want to see before documenting this
and supporting someone in future who finds a way to prove the claim
wrong?

> 
> One more thing that could help said service is if overlayfs
> supported a hybrid mode of redirect_dir=follow,metacopy=on,
> where redirect is enabled for regular files for metacopy, but NOT
> enabled for directories (which was redirect_dir original use case).
> 
> This way, the service could run the command line:
> $ mv /ovl/blah/thing /ovl/local
> then "mv" will get EXDEV for moving directories and will create
> opaque directories in their place and it will recursively move all
> the files to the opaque directories.

Clever.

Thanks,
Mike