We aren't working on this at the moment, but I did have some off-list discuss discussion with Amir Goldstein. I wanted to include our correspondence on this list for posterity --- e.g. us working on this later, someone else working on this later, etc., and he said that is fine, so here it is. It took me a while to find the time splice together all the replies and their quotes, but now I have it. One thing I'll add is that while I still think this "resetting" operation is a good feature for overlay fs in general (and not just our use-case), the FUSE passthrough work (from Android most recently, not yet merged AFIAK) would be an even better fit for our use-case than overlay fs. I don't know if upstreaming that is still being pursued, but if it is, it seams reasonable to just wait for it. Indeed, FUSE passthrough ought to be a good replacement for the whole gamut of exotic bind/union/overlay mounting without out adding endless more functionality and code to the kernel. Cheers, John > On Mon, Jul 24, 2023, 9:29 PM John Ericson <john.ericson@obsidian.systems> wrote: >> On 7/24/23 03:51, Amir Goldstein wrote: >>> On Sun, Jul 23, 2023, 6:38 PM John Ericson >>>> Thanks for this information, Amir. It's very useful. >>>> >>>> On Thu, Jul 20, 2023, at 8:51 PM, Amir Goldstein wrote: >>>>> Hi >>>>> >>>>> Writing offlist because I'm on mobile >>>>> The problem is hard and you are not the first one to ask about it - need to narrow it down to exact requirements to be able to solve it. >>>> Yes, not surprised others have asked about this. If you can point me to the previous times it has been discussed (which I failed to find before), I am happy to read that correspondence rather than ask questions which may already be answered :). >>> Afaik it never got passed "we want to do that"... "It's complicated " maybe you'll be the first ;) >> OK :) >>>> Also, if there was something describing the overall approach of the in-memory data structures, (https://docs.kernel.org/filesystems/overlayfs.html is more user focused, though one can get an inkling of what might be going on from squinting at it), that would be tremendously useful. (Based on the rest of your comments, I think this is the main thing I am missing.) >>>>> On Thu, Jul 20, 2023, 7:37 PM John Ericson <john.ericson@obsidian.systems> wrote: >>>>>> We would like to be able to "reset" an overlay-fs directory entry, i.e. >>>>>> remove whatever might exist for this entry in the upper layer and revert >>>>>> back to whatever is in lower layer. This operation would be akin to a >>>>>> regular removal, except without creating whiteouts to cover up anything >>>>>> in the lower layer. >>>>> >>>>> Do you actually need to get rid of the upper entry or is it ok to just reset the upper entry to the metadata of the lower entry and make it a transparent "metacopy" ? >>>> Today we are modifying the upper layer and then remounting. This has the semantics we want, but of course side-steps these issues by rebuilding the in-memory data overlayfs data structures from scratch (I presume). I am basically open to whatever approach is >>>> easiest that roughly corresponds to those semantics; not trying to put the cart before the horse here demanding extra requirements when I do not know the details of the current implementation well :). >>> No documentation that I know of. Vfs is not very well documented. >> Gotcha. Well, at least good to know I wasn't missing something. >>>>> What if lower entry does not exist? >>>> So not sure how you envisioned this API. >>>>> What if upper was renamed after copy up? And then >>>> For what it is worth, we are not using redict_dir or the inode index. I am afraid I do not understand the significance of renaming outside extensions like those. I can imagine renames/hardlinks can cause inodes to be reused across layers in perhaps-surprising >>>> ways , but I didn't think overlay fs would care about this much. >>>>> What if lower entry is another file or even a directory with same name? >>>> It should be exposed regardless. I was hoping this would still be an O(1) changing some references in the in-memory VFS data structures, but if it is O(n) because the overlayfs has a its own separate copies to a greater degree than I thought, I could see that >>>> being a problem. >>> So not sure how you envisioned this API. >> My interpretation of your idea was similar to `unlinkat`, basically: >> >> int overlayfs_reset_at(int overlay_dirfd, int lower_dirfd, const char pathname, int flags); >> >> The idea is that the path is relative *both* directory file descriptors. If the path has a /, and *both subdirs exist*, this: >> >> overlayfs_reset_at(foo, bar, "asdf/qwer", AT_REMOVEDIR); >> >> >> is shorthand for: >> >> overlayfs_reset_at(openat(foo, "asdf", O_PATH), openat(bar, "asdf", O_PATH), "qwer", AT_REMOVEDIR); >> >> the interesting case is if both subdirs do not exist: >> >> 1. If the lower one doesn't exist, it means the entry in question is within an upper-/overly-only directory. We cannot reset the entry because the overlay/upper parent directory (and possibly some ancestors also) is itself covering it up. The operation can fail in this case, or we can just default to doing a plain old removal instead. >> 2. If the overlay/upper one doesn't exist, it means their must be a file that is covering up an ancestor directory in the directory in the lower layer. The operation fails. >> So the only case where the operation succeeds with a bonafide reset is when we can "cinch up" to both (immediate) parent directories. And to implement the permission check we just need check the read+execute permissions on the lower one. (Conversely, I *don't* think it matters what the permissions on the target of the entry revealed by the restore is, because those will be carried through to the (modified) overlayfs and respected. It is just the existence of the entry which we are guarding against leaking. >> >> All that said, to go back and walk through your scenario >> >>> After rename, the target is covering no entry and the source is a whiteout. >> Great, my understanding matches that. >>> You cannot open "the whiteout" for ioctl so you would not be able to uncover the lower original file with the suggested ioctl method. >>> Same goes for "undoing a delete". >> You should not need to ever open a whiteout, but just the directory that contains the whiteout. >>> In this case maybe using link() with a special sort of tmpfile and using an ioctl with lower real fd as an input argument to overlayfs as a way to get a special sort of lower tmpfile to link in place of the whiteout entry. >> If we are just opening the directories, I don't think these extra hoops are needed? Or am I missing something? > > All I can say is it looks like a very big maybe. > I can't see big flaws right now, but I think there are more details to find out yet > > Also with all the special cases that are not handled you will need to argue your case that the limited functionally is useful for an interesting use case. > >>>>>> As far as our team could discern, the kernel currently does not support >>>>>> this operation. Thus, we are considering what would be necessary to >>>>>> implement this ourselves. Our initial exploration led us to >>>>>> `ovl_do_remove` within `fs/overlayfs/dir.c` and in particular this >>>>>> conditional: >>>>>> >>>>>> if (!lower_positive) >>>>>> err = ovl_remove_upper(dentry, is_dir, &list); >>>>>> else >>>>>> err = ovl_remove_and_whiteout(dentry, &list); >>>>>> >>>>>> That seemed like a good place to begin --- if one were to force the >>>>>> first case no new whiteouts would be created, correct? >>>>> >>>>> I don't think remove is the right way. >>>>> Hard for me to explain why. >>>>> The implementation would be more complicated than this if every to metacopynis not enough and there is no good reference code for doing something like this. >>>> Fair enough. Yes, it is not at all clear what the ramifications of *not* doing a whiteout here are; I wouldn't want to make the underlying layers out of sync with the VFS! >>> The easiest case is to "punch out" the modified data using an ioctl. It does not cover undoing a delete or rename. >> Hmm, I am not sure I follow. Do you mean evict the information on this part of the from the VFS so it must be rebuilt on demand from the underlying layers? If so, that sounds very promising; much nicer to do that and let things be rebuilt on demand by existing code. >>>>> >>>>>> Assuming that is indeed the right place to start, I have two follow-up >>>>>> questions. >>>>>> >>>>>> 1. Since the desired end result of the operation is strictly closer to >>>>>> the lower layer, should we possibly eliminate some of the other >>>>>> operations in a fresh copy of this function? For instance, might >>>>>> `ovl_copy_up` be unnecessary because if the upper layer already doesn't >>>>>> "contribute" to this dir entry, no action would need to be taken? >>>>>> Additionally, what is the significance of `nlink`? We have not found >>>>>> much documentation for it; from what we understand, it's an `xattr` used >>>>>> so some information for the overlay-fs is persisted on disk. >>>>> >>>>> Hard to explain - if you keep upper metacopy and punch our the data all of the above is not relevant. >>>>> >>>>>> 2. What is the recommended approach to expose this functionality? We >>>>>> assume it would be through a new `ioctl`, but with no existing >>>>>> overlay-fs-specific `ioctl` as a reference, we are unsure if that would >>>>>> be the correct choice. We presume there are best practices on this >>>>>> matter that we are not currently aware of. >>>>> >>>>> Not sure. I think there was an ioctl for getflags and it was remove. You can look at git history. >>>> Thanks, I see it was removed in c4fe8aef2f07c8a41169bcb2c925f6a3a6818ca3. I can work from that. (However, I'll set this question aside until I know more about the fundamentals of what we would be doing.) >>>>>> Our intention is to upstream this patch if we write it. It would be >>>>>> therefore beneficial to discuss any objections or concerns beforehand. >>>>>> For instance, one possible issue could be overlay-fs usage which >>>>>> presumes that covered up lower layer data is private and inaccessible. >>>>>> To make it possible to preserve that invariant, permissions for this >>>>>> operation would have to be distinct from write permissions. This concern >>>>>> can thus be addressed, but it would increase the scope of the patch. >>>>> >>>>> I think it is best if the API can prove user has access to lower object (do you have direct access to lower layer?) If ioctl passes open fd of the lower object that caller can read from, it can be used as a proof that there is no security concern with exposing the lower entry data because same user can rewrite the same data. >>>> Oh this is very elegant. Great idea! >>> Good luck. >> Thanks, we'll need it