Re: topics for the file system mini-summit

Matthew Wilcox <matthew@xxxxxx> · Thu, 1 Jun 2006 06:45:17 -0600

On Wed, May 31, 2006 at 08:24:18PM -0700, Valerie Henson wrote:
> Actually, the continuation inode is in B.  When we create a link in
> directory A to file C, a continuation inode for directory A is created
> in domain B, and a block containing the link to file C is allocated
> inside domain B as well.  So there is no continuation inode in domain
> A.
> 
> That being said, this idea is at the hand-waving stage and probably
> has many other (hopefully non-fatal) flaws.  Thanks for taking a look!

OK, so we really have two kinds of continuation inodes, and it might be
sensible to name them differently.  We have "here's some extra data for
that inode over there" and "here's a hardlink from another domain".  I
dub the first one a 'continuation inode' and the second a 'shadow inode'.

Continuation inodes and shadow inodes both suffer from the problem
that they might be unwittingly orphaned, unless they have some kind of
back-link to their referrer.  That seems more soluble though.  The domain
B minifsck can check to see if the backlinked inode or directory is
still there.  If the domain A minifsck prunes something which has a link
to domain B, it should be able to just remove the continuation/shadow
inode there, without fscking domain B.

Another advantage to this is that inodes never refer to blocks outside
their zone, so we can forget about all this '64-bit block number' crap.
We don't even need 64-bit inode numbers -- we can use special direntries
for shadow inodes, and inodes which refer to continuation inodes need
a new encoding scheme anyway.  Normal inodes would remain 32-bit and
refer to the local domain, and shadow/continuation inode numbers would
be 32-bits of domain, plus 32-bits of inode within that domain.

So I like this ;-)

> > Surely XFS must have a more elegant solution than this?
> 
> val@goober:/usr/src/linux-2.6.16.19$ wc -l `find fs/xfs/ -type f`
> [snip]
>  109083 total

Well, yes.  I think that inside the Linux XFS implementation there's a
small and neat filesystem struggling to get out.  Once SGI finally dies,
perhaps we can rip out all the CXFS stubs and IRIX combatability.  Then
we might be able to see it.

For fun, if you're a masochist, try to follow the code flow for
something easy like fsync().

const struct file_operations xfs_file_operations = {
        .fsync          = xfs_file_fsync,
}

xfs_file_fsync(struct file *filp, struct dentry *dentry, int datasync)
{
        struct inode    *inode = dentry->d_inode;
        vnode_t         *vp = vn_from_inode(inode);
        int             error;
        int             flags = FSYNC_WAIT;

        if (datasync)
                flags |= FSYNC_DATA;
        VOP_FSYNC(vp, flags, NULL, (xfs_off_t)0, (xfs_off_t)-1, error);
        return -error;
}

#define _VOP_(op, vp)   (*((vnodeops_t *)(vp)->v_fops)->op)

#define VOP_FSYNC(vp,f,cr,b,e,rv)                                       \
        rv = _VOP_(vop_fsync, vp)((vp)->v_fbhv,f,cr,b,e)

vnodeops_t xfs_vnodeops = {
        .vop_fsync              = xfs_fsync,
}

Finally, xfs_fsync actually does the work.  The best bit about all this
abstraction is that there's only one xfs_vnodeops defined!  So this could
all be done with an xfs_file_fsync() that munged its parameters and called
xfs_fsync() directly.  That wouldn't even affect IRIX combatability,
but it would make life difficult for CXFS, apparently.

http://oss.sgi.com/projects/xfs/mail_archive/200308/msg00214.html
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html