Re: overlayfs: issue with a replaced lower squashfs with export-table

Amir Goldstein <amir73il@xxxxxxxxx> · Tue, 7 Jul 2020 08:51:53 +0300

On Mon, Jul 6, 2020 at 6:14 PM Fabian <godi.beat@xxxxxxx> wrote:
>
> Hi Amir,
>
> thanks for your mail and the quick reply!
>
> Am Montag, 6. Juli 2020, 16:29:51 CEST schrieb Amir Goldstein:
> > > We are seeing problems using an read-writeable overlayfs (upper) on a
> > > readonly squashfs (lower). The squashfs gets an update from time to time
> > > while we keep the upper overlayfs.
> >
> > It gets updated while the overlay is offline (not mounted) correct?
>
> Yes. We boot into a recovery system outside the rootfs and its overlayfs,
> replace the lower squashfs, and then reboot into the new system.
>
> > > On replaced files we then see -ESTALE ("overlayfs: failed to get inode
> > > (-116)") messages if the lower squashfs was created _without_ using the
> > > "-no-exports" switch.
> > > The -ESTALE comes from ovl_get_inode() which in turn calls
> > > ovl_verify_inode() and returns on the line where the upperdentry inode
> > > gets compared
> > > ( if (upperdentry && ovl_inode_upper(inode) != d_inode(upperdentry)) ).
> > >
> > > A little debugging shows, that the upper files dentry name does not fit to
> > > the dentry name of the new lower dentry as it seems to look for the inode
> > > on the squashfs "export"-lookup-table which has changed as we replaced
> > > the lower fs.
> > >
> > > Building the lower squashfs with the "-no-exports"-mksquashfs option, so
> > > without the export-lookup-table, seems to work, but it might be no longer
> > > exportable using nfs (which is ok and we can keep with it).

Miklos,

At first glance I did not understand how changing lower file handles causes
failure to ovl_verify_inode().
To complete the picture, here is the explanation.

Upper file A was copied up from lower file with inode 10 in old squashfs
and the "origin" file handle composed of the inode number 10 is recorded
in upper file A.

With newly formatted lower, lower A has inode 11 and lower B has inode 10.
Upper file B is copied from lower file B with inode 10 in new squashfs and
the "origin" file handle composed of the inode number 10 is recorded
in upper file B.
Now we have two upper files with the same "origin" that are not hardlinks.

On lookup of both overlay files A and B, ovl_check_origin() decodes lower
file B (inode 10) as the lower inode.
This lower inode is used to get the overlay inode number (10) and as
the key to hash overlay inode in inode cache.

Suppose A is looked up first and it's inode is hashed.
Then B is looked up and in ovl_get_inode() it finds the inode hashed
by the same lower inode in inode cache, but fails ovl_verify_inode()
because:
d_inode(upperdentry) /* B */ != ovl_inode_upper(inode) /* A */

This can also happen when copying overlay layers to a new
fs tree and carrying over the old "origin" xattr.
In practice, the UUID part of the stored "origin" xattr is meant to
protect against decoding lower fh when migrating to another
filesystem, but layers could be migrated inside the same filesystem.
Since squashfs does not have a UUID, re-creating sqhashfs is similar
to migrating layers inside the same filesystem.

We were aware of the "layer migration" case when designing the
index/nfs_export feature, which is one of the reasons they are
opt-in features.

But we enabled the functionality of following non-dir origin
unconditionally because we *thought* it is harmless, as the comment
in ovl_lookup() says:

         /*
         * Lookup copy up origin by decoding origin file handle.
         * We may get a disconnected dentry, which is fine,
         * because we only need to hold the origin inode in
         * cache and use its inode number.  We may even get a
         * connected dentry, that is not under any of the lower
         * layers root.  That is also fine for using it's inode
         * number - it's the same as if we held a reference
         * to a dentry in lower layer that was moved under us.
         */

The patch I posted disabled decoding of non-dir origin for the special
case of lower null uuid.

I think we can also warn and auto-disable decoding non-dir origin in
case index is disabled and we detect this upper inode conflict in
ovl_verify_inode().

The problem is if A is not metacopy and looked up first, and B is
metacopy and looked up second, then conflict will be deleted after
the wrong inode has been hashed.

Perhaps we should disable decoding non-dir origin with in case
metacopy=on,index=off?
Maybe also provide a user option to disable decoding non-dir origin
at the price of losing persistent inode number for copied up non-dir?
Something like 'index=nofollow'.

Thoughts?
Am I overthinking this?

Thanks,
Amir.