On Thu, 17 Oct 2024 at 10:18, Amir Goldstein <amir73il@xxxxxxxxx> wrote: > It has been like that since the first upstream version. > My guess is that it is an attempt to avoid turning wdentry > into a negative dentry, which is not expected to be useful in > ovl_clear_empty() situations, but this is just a guess. Yes. This was discussed in a private thread before merging overlayfs upstream. Copying relevant parts here: ---------- From: Al Viro <viro@xxxxxxxxxxxxxxxxxx> Date: Sat, 18 Oct 2014 at 10:18 To: Miklos Szeredi <miklos@xxxxxxxxxx> Cc: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>, <mszeredi@xxxxxxx>, David Howells <dhowells@xxxxxxxxxx>, Sage Weil <sage@xxxxxxxxxxx> [Cc to Sage due to interesting ceph bit that has shown up from grepping - see the very end] On Sat, Oct 18, 2014 at 06:01:53AM +0100, Al Viro wrote: First of all, I've just fixed a dumb braino in ovl_clear_empty(); assignment to upper needed to be moved up to the added test. Force-pushed to the same branch - vfs.git#ovl-experimental. > As for the "what filesystems are we OK with", I wonder if looking into the > sucker's ->s_d_op (or ->d_op of root of lower tree, for that matter) would > be a good approximation. I really think that ->d_{weak_,}revalidate() in > there is complete no-go, ditto for ->d_manage() and ->d_automount() and > I would consider ->d_compare() or ->d_hash() as a cause to be _very_ cautious. > > Alternatively, we could just go ahead and add FS_OK_FOR_OVERLAY_LOWER into > fs type flags and mark the obvious good ones. It's not _that_ much work. > > I'd still like to hear details on the plans re d_path(); I don't consider > that a deal-breaker, but we'd better have some clear idea of what we are > getting into. BTW, why on the Earth are you pinning that ->__upperdentry twice? The comment about d_delete() makes no sense whatsoever - anything other than overlayfs itself would have to grab a reference to call that d_delete(), which would give you refcount greater than 1 automatically. So it would have to be overlayfs passing that thing to d_delete() or something that would call it, right? Now, d_delete() itself isn't called there at all. Which leaves passing the sucker to something outside that would call d_delete(). Now, what would it be? Here's the full list of d_delete() callers: arch/s390/hypfs/inode.c:82: d_delete(dentry); drivers/infiniband/hw/ipath/ipath_fs.c:318: d_delete(dir); drivers/infiniband/hw/qib/qib_fs.c:512: d_delete(dir); drivers/usb/gadget/function/f_fs.c:1560: d_delete(epfile->dentry); drivers/usb/gadget/legacy/inode.c:1611: d_delete (dentry); fs/btrfs/ioctl.c:2507: d_delete(dentry); fs/ceph/dir.c:893: d_delete(dentry); fs/ceph/inode.c:1114: d_delete(dn); fs/ceph/inode.c:1223: d_delete(dn); fs/ceph/inode.c:1395: d_delete(dn); fs/configfs/dir.c:643: d_delete(child); fs/configfs/dir.c:834: d_delete(dentry); fs/configfs/dir.c:880: d_delete(dentry); fs/configfs/dir.c:1475: d_delete(new_dentry); fs/configfs/dir.c:1721: d_delete(dentry); fs/debugfs/inode.c:483: d_delete(dentry); fs/devpts/inode.c:666: d_delete(dentry); fs/efivarfs/file.c:50: d_delete(file->f_dentry); fs/fuse/dir.c:1061: d_delete(entry); fs/namei.c:3586: d_delete(dentry); fs/namei.c:3702: d_delete(dentry); fs/nfs/dir.c:1760: d_delete(dentry); fs/nfs/nfs4proc.c:2231: d_delete(dentry); fs/ocfs2/dlmglue.c:3752: d_delete(dentry); fs/reiserfs/xattr.c:95: d_delete(dentry); fs/reiserfs/xattr.c:111: d_delete(dentry); net/sunrpc/rpc_pipe.c:607: d_delete(dentry); net/sunrpc/rpc_pipe.c:634: d_delete(dentry); security/selinux/selinuxfs.c:1212: d_delete(d); We are talking about the *upper* layer; that excludes most of those guys. At the very least, you want that fs to support rename and xattrs. So hypfs, infinibarf ones, gadgetfs, configfs, debugfs, devpts, efivarfs, sunrpc and selinuxfs are right out. Moreover, all of those are not in the codepaths reachable from overlayfs - all of that is removal of object driven by external event. And we end up using a reference other than what overlayfs would be holding. The same goes for reiserfs xattr code (it calls d_delete() for references it has acquired itself) and for ocfs2. NFS is also not an option for upper layer, according to overlayfs docs. FUSE is in the same boat as ocfs2 and reiserfs - we acquire the reference by d_lookup() in the same function. The same goes for btrfs caller (s/d_lookup/lookup_one_len/), not to mention that this code won't be called by overlayfs. What's left? fs/ceph/dir.c:893: d_delete(dentry); ceph_unlink(). fs/ceph/inode.c:1114: d_delete(dn); ceph_fill_trace(), dn comes from d_lookup(). Not an issue. fs/ceph/inode.c:1395: d_delete(dn); ceph_readdir_prepopulate(), dn comes from d_lookup(). Not an issue. fs/ceph/inode.c:1223: d_delete(dn); ceph_fill_trace(), again. Hell knows - it's really hard to read ;-/ fs/namei.c:3586: d_delete(dentry); vfs_rmdir() fs/namei.c:3702: d_delete(dentry); vfs_unlink() Now, ceph_unlink() can come only from vfs_unlink(). So we are down to the following: victim of vfs_unlink(), victim of vfs_rmdir(), _maybe_ something strange coming from that ceph_fill_trace() callsite. We definitely do have vfs_unlink() and vfs_rmdir() calls in overlayfs. Not many, though - there's ovl_remove_upper(), calling them directly, and there's ovl_cleanup(), calling them via ovl_do_{unlink,rmdir}() wrappers. For one thing, we could do dget()/dput() in both of those guys. However, looking at the callers of ovl_cleanup() shows that with two exceptions we have already grabbed/dropped dentry around the callsite. Exceptions are ovl_clear_empty() and ovl_remove_and_whiteout(). Could as well put that dget()/dput() in those two, rather than in ovl_clear_empty()... IOW, modulo that ceph thing we could trivially avoid that double reference to ->__upperdentry, just by doing a temporary dget()/dput() in a few places in fs/overlayfs/dir.c. Objections? A bunch of code becomes simpler that way, IMO... Question to Sage: what's that d_delete() in ceph_fill_trace() about? It's this bit: /* null dentry? */ if (!rinfo->head->is_target) { dout("fill_trace null dentry\n"); if (dn->d_inode) { dout("d_delete %p\n", dn); d_delete(dn); } else { dout("d_instantiate %p NULL\n", dn); d_instantiate(dn, NULL); if (have_lease && d_unhashed(dn)) d_rehash(dn); update_dentry_lease(dn, rinfo->dlease, session, req->r_request_started); } goto done; } What codepaths could lead us there and where could that dentry have come from? Overlayfs aside, the things can get rather interesting if it could, e.g. turn out to be an existing mountpoint... ---------- From: David Howells <dhowells@xxxxxxxxxx> Date: Sun, 19 Oct 2014 at 11:32 To: Al Viro <viro@xxxxxxxxxxxxxxxxxx> Cc: <dhowells@xxxxxxxxxx>, Miklos Szeredi <miklos@xxxxxxxxxx>, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>, <mszeredi@xxxxxxx> Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote: > David, could you give a braindump on selinux issues? I hadn't watched your > conversation with Ian closely, so I'd rather avoid doing second-hand > retelling... The problem boils down to this: When they fall through from the top layer to a lower source layer, Overlayfs and unionmount both set up struct file to point directly to the file in the lower layer. This is then passed to various security_xxx() functions. For labelled-inode-based LSM's this means they see the label on the lower inode, not the label on the overlay/union inode - indeed in unionmount, there *is* no upper/union inode. Overlayfs is more complicated than unionmount in that there are three layers and it falls through from the overlay layer to the upper layer (ie. the change stash) too. In this case also the overlay inode is unavailable to the LSM. Unionmount mitigates the lower-layer label problem by pointing file->f_path at the union dentry and file->f_inode at the lower inode (and file->f_path->dentry->d_fallthru at the lower dentry). Further, unionmount has no upper layer problem since the changes are stored in the union layer itself. Having discussed this with various people in the context of docker, a tentative consensus has been reached: (1) The docker source tree (ie. the lower layer) will all be under a single label. (2) The docker root (ie. the overlay/union layer) will all be under a single, but different label set on the overlay mount (and each docker root may be under its own label). (3) Inodes in the overlayfs upper layer will be given the overlay label. (4) A security_copy_up() operation will be provided to set the label on the upper inode when it is created. (5) A security_copy_up_xattr() operation will be provided to vet (and maybe modify) each xattr as it is copied up. (6) An extra label slot will be placed in struct file_security_struct to hold the overlay label. (7) security_file_open() will need to be given both the overlay and lower dentries. For overlayfs, the way this probably should be done is file->f_path should be set to point to the overlay dentry (thus getting /proc right) and file->f_inode to the lower file and make use of d_fallthru in the overlay dentry in common with unionmount. (8) When the lower file is accessed, both the lower and overlay labels should be checked and audited. (9) When the upper file is accessed, only the overlay label needs to be checked and audited. David ---------- From: Al Viro <viro@xxxxxxxxxxxxxxxxxx> Date: Mon, 20 Oct 2014 at 17:47 To: David Howells <dhowells@xxxxxxxxxx> Cc: Miklos Szeredi <miklos@xxxxxxxxxx>, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>, <mszeredi@xxxxxxx> On Sun, Oct 19, 2014 at 10:31:52AM +0100, David Howells wrote: > For overlayfs, the way this probably should be done is file->f_path should > be set to point to the overlay dentry (thus getting /proc right) and > file->f_inode to the lower file and make use of d_fallthru in the overlay > dentry in common with unionmount. To elaborate a bit: a _lot_ of places in filesystems that used to use ->f_path.dentry->d_inode had been eliminated in favour of file_inode(...) and all the remaining ones ought to follow. With that done (I was actually planning to do whack-a-mole session on those guys after most of this cycle merges would be done - Linus, would you accept that in -rc2?) we get surprisingly few places that even look at ->f_path.dentry. Some of those are refering to ->d_name; we need to review those for other reasons (potential rename() races), but for unionmount/overlayfs purposes we couldn't care less which of dentries is used - both overlayfs and underlying fs dentry have the same name. FWIW, a bunch of uses are in printks, and those should become %pD... A bunch of places uses ->f_path.dentry->d_sb to get the superblock by file; file_inode()->i_sb would do just fine in filesystems. And places like that *outside* of filesystems need a bit of review - the question is which superblock do we want? That of overlayfs or that of the layer? The latter would be file_inode()->i_sb, again. The former would be a problem with overlayfs in its current form; with leaving f_path to point to overlayfs it would work fine. dir_emit_dot() and dir_emit_dotdot() use ->f_path.dentry, but those are not problem - overlayfs explicitly opens directory in a layer. AFAICS, nothing of what remains is on paths hot enough to really care about an extra dereference. So I think that after the dust from cleanups settles, we'll be able to add an inlined helper usable for accesses to file's dentries, and ban open-coded ->f_path.dentry in filesystems. ---------- From: Al Viro <viro@xxxxxxxxxxxxxxxxxx> Date: Wed, 22 Oct 2014 at 05:36 To: Miklos Szeredi <miklos@xxxxxxxxxx> Cc: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>, <mszeredi@xxxxxxx> On Fri, Oct 17, 2014 at 05:30:52PM +0200, Miklos Szeredi wrote: > Will do patches ASAP (which probably means next week) for all but having proper > d_path() on the leaves. Ping? d_path() will obviously have to wait (my preference would be to leave proper overlayfs dentry in ->f_path.dentry, set ->f_inode to one from the layer and make sure that all filesystem code will DTRT; we are *very* close to that already), but that's a) doable after the thing went in (it's not that much rewrite and VFS-side it will probably just mean death to ->dentry_open() - sure, it's not nice to put the method in just for one cycle, but it's not particulary tragic, especially if it's clearly marked as "it's going away very soon, do not rely on it") and b) is next cycle fodder The same goes for the weirdness with double dget() on __upperdentry - it's absolutely self-contained and we can deal with it later. I would obviously prefer fewer odd warts when it goes in, but this one isn't a bug per se. rmdir() failure, OTOH, is one. So's the memory footprint of cached union of directories, seeing that every struct file over a directory gets a copy of its own. So's accepting layers with non-trivial ->d_revalidate, ->d_compare, ->d_hash, ->d_automount and ->d_manage. The interim branch is in vfs.git#ovl-experimental; do you want me to post it as-is same way you posted previous iterations? Another thing: one more mail in that thread had bounced from google mailsewers. This time it claimed that delivery to Linus has failed. I've no idea if they report all addresses; I've put the bounce message on ftp.linux.org.uk/pub/viro/bounced2. I really wonder what's going on with those filters... ---------- From: Miklos Szeredi <miklos@xxxxxxxxxx> Date: Wed, 22 Oct 2014 at 10:12 To: Al Viro <viro@xxxxxxxxxxxxxxxxxx> Cc: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>, <mszeredi@xxxxxxx> On Wed, Oct 22, 2014 at 04:35:58AM +0100, Al Viro wrote: > On Fri, Oct 17, 2014 at 05:30:52PM +0200, Miklos Szeredi wrote: > > > Will do patches ASAP (which probably means next week) for all but having proper > > d_path() on the leaves. > > Ping? d_path() will obviously have to wait (my preference would be to leave > proper overlayfs dentry in ->f_path.dentry, set ->f_inode to one from the > layer and make sure that all filesystem code will DTRT; we are *very* close > to that already), but that's > a) doable after the thing went in (it's not that much rewrite and > VFS-side it will probably just mean death to ->dentry_open() - sure, it's > not nice to put the method in just for one cycle, but it's not particulary > tragic, especially if it's clearly marked as "it's going away very soon, > do not rely on it") and > b) is next cycle fodder > > The same goes for the weirdness with double dget() on __upperdentry - it's > absolutely self-contained and we can deal with it later. Done, but not terribly urgent for me either. > I would obviously > prefer fewer odd warts when it goes in, but this one isn't a bug per se. > > rmdir() failure, OTOH, is one. I looked at this, and found no bug: it does raise CAP_DAC_OVERRIDE in both callers (ovl_do_remove() and ovl_rename2()). > So's the memory footprint of cached > union of directories, seeing that every struct file over a directory > gets a copy of its own. Done, at the cost of 100 or so extra lines *and* is still DoS-able if the directory is changed between reading it from the different file instances. I've been poring over this last night, and the proper solution would be to update the cache as it changes, so we have only one cache. Not hard to do conceptually, but not a small change either. I'm wondering if it's OK to get i_mutex in ->release(). It would simplify the locking... > So's accepting layers with non-trivial ->d_revalidate, > ->d_compare, ->d_hash, ->d_automount and ->d_manage. Done. Also moved notify_change() into copy-up for setattr, so the attribute update is atomic. And a few other cleanups and fixes. Updated the overlayfs.current branch here: git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs.git overlayfs.current Thanks, Miklos