Re: moving affs + RDB partition support to staging?

Al Viro <viro@xxxxxxxxxxxxxxxxxx> · Sun, 6 May 2018 08:40:00 +0100

On Sun, May 06, 2018 at 01:59:51AM +0100, Al Viro wrote:

> > There is nothing at the moment that needs fixing.
> 
> Funny, that...  I'd been going through the damn thing for the
> last week or so; open-by-fhandle/nfs export support is completely
> buggered.  And as for the rest... the least said about the error
> handling, the better - something like rename() hitting an IO
> error (read one) can not only screw the on-disk data into the
> ground, it can do seriously bad things to kernel data structures.

... and while we are at it, consider the following:

mkdir a
mkdir b
touch a/x
ln a/x b

<umount and mount again, to get cold dcache, or just wait for memory
pressure to evict those suckers>

process A: unlink("a/x");
process B: open("b/x");

We had two entries - one in a, another in b; the one in a is ST_FILE one,
the one in b - ST_LINKFILE, with ->original refering to the ST_FILE one.

unlink() needs to kick the entry out of a; since it can't leave an
ST_FILE not included into any directory (and since the file should live on
due to b/x still being there) it ends up removing ST_LINKFILE entry from
b and moving ST_FILE one in its place.  That happens in affs_remove_link().

However, open() gets there just as this surgery is about to begin.
It gets to affs_lookup(), which finds the entry for b/x, reads it,
stashes block number of that entry into dentry->d_fsdata, notices
that it's ST_LINKFILE one, picks the block number of ST_FILE one
out of it and proceeds to look up the inode; that doesn't end up
doing any work (the same inode is in icache due to a/x having been
looked up by unlink(2)).

In the meanwhile, since the hash table of b has been unlocked once
we'd done hash lookup, affs_remove_link() can proceed with the
surgery.  It *does* try to catch dentries with ->d_fsdata containing
the block number of sacrificed ST_LINKFILE and reset that to block
number of ST_FILE that gets moved in its place.  However, it does
so by
static void
affs_fix_dcache(struct inode *inode, u32 entry_ino)
{
        struct dentry *dentry;
        spin_lock(&inode->i_lock);
        hlist_for_each_entry(dentry, &inode->i_dentry, d_u.d_alias) {
                if (entry_ino == (u32)(long)dentry->d_fsdata) {
                        dentry->d_fsdata = (void *)inode->i_ino;
                        break;
                }
        }
        spin_unlock(&inode->i_lock);
}
i.e. looping through dentries that point to our inode.  Except that
*our* dentry isn't attached to inode yet, so we are SOL - it's
left with ->d_fsdata pointing to (destroyed) ST_LINKFILE.

After a while process B closes the file and unlinks it.  Take a look
at affs_remove_header() and you'll see how much fun we are in for -
it uses ->d_fsdata to pick the entry to find and remove...

That's an fs corruptor, plain and simple.  As far as I had been able
to reconstruct the history, leftover after Roman's locking changes
17 years ago.  AFAICS, I'd missed it back then - the races I'd spotted
had been dealt with about a year later (when we started to lock the
victim in vfs_unlink/vfs_rmdir/vfs_rename), but this one went unnoticed...