Re: topics for the file system mini-summit

Russell Cattelan <cattelan@xxxxxxxxxxx> · Thu, 01 Jun 2006 15:06:19 -0500

Matthew Wilcox wrote:

On Wed, May 31, 2006 at 08:24:18PM -0700, Valerie Henson wrote:

Actually, the continuation inode is in B.  When we create a link in
directory A to file C, a continuation inode for directory A is created
in domain B, and a block containing the link to file C is allocated
inside domain B as well.  So there is no continuation inode in domain
A.

That being said, this idea is at the hand-waving stage and probably
has many other (hopefully non-fatal) flaws.  Thanks for taking a look!

OK, so we really have two kinds of continuation inodes, and it might be
sensible to name them differently.  We have "here's some extra data for
that inode over there" and "here's a hardlink from another domain".  I
dub the first one a 'continuation inode' and the second a 'shadow inode'.

Continuation inodes and shadow inodes both suffer from the problem
that they might be unwittingly orphaned, unless they have some kind of
back-link to their referrer.  That seems more soluble though.  The domain
B minifsck can check to see if the backlinked inode or directory is
still there.  If the domain A minifsck prunes something which has a link
to domain B, it should be able to just remove the continuation/shadow
inode there, without fscking domain B.

Another advantage to this is that inodes never refer to blocks outside
their zone, so we can forget about all this '64-bit block number' crap.
We don't even need 64-bit inode numbers -- we can use special direntries
for shadow inodes, and inodes which refer to continuation inodes need
a new encoding scheme anyway.  Normal inodes would remain 32-bit and
refer to the local domain, and shadow/continuation inode numbers would
be 32-bits of domain, plus 32-bits of inode within that domain.

So I like this ;-)

Surely XFS must have a more elegant solution than this?

XFS may be a bit better suited to do this "encapsulated" form of 
inode/directory management
since it's AG's already tried to keep meta data close to the file data.
So it would be quite feasible to offline particular AG's and do a 
consistency check on it.

But yes hard links pose the same problem as being discussed here.
File data also can span AG's and thus create interdependency of AG's in 
terms both file
data and the meta data blocks that manage the extents.
But the idea of idea of creating continuation inodes seem like a good one.
For XFS is might be better to do this at the AG level so as soon as a 
hard link
in one AG refers to a inode it another AG the AG's are linked flagged as
being linked.
This would allow for any form of interdependent data to be grouped
(quota's extended attributes etc)

val@goober:/usr/src/linux-2.6.16.19$ wc -l `find fs/xfs/ -type f`
[snip]
109083 total

Well, yes.  I think that inside the Linux XFS implementation there's a
small and neat filesystem struggling to get out.  Once SGI finally dies,
perhaps we can rip out all the CXFS stubs and IRIX combatability.  Then
we might be able to see it.

For fun, if you're a masochist, try to follow the code flow for
something easy like fsync().

const struct file_operations xfs_file_operations = {
       .fsync          = xfs_file_fsync,
}

xfs_file_fsync(struct file *filp, struct dentry *dentry, int datasync)
{
       struct inode    *inode = dentry->d_inode;
       vnode_t         *vp = vn_from_inode(inode);
       int             error;
       int             flags = FSYNC_WAIT;

       if (datasync)
               flags |= FSYNC_DATA;
       VOP_FSYNC(vp, flags, NULL, (xfs_off_t)0, (xfs_off_t)-1, error);
       return -error;
}

#define _VOP_(op, vp)   (*((vnodeops_t *)(vp)->v_fops)->op)

Don't forget the extremely hard to untangle behaviors.
#define VNHEAD(vp)    ((vp)->v_bh.bh_first)
#define VOP(op, vp)    (*((bhv_vnodeops_t *)VNHEAD(vp)->bd_ops)->op)

Which I won't even try to explain cuz they confuse the crap out me.
But that is what CXFS uses to create different call chains.

Oh and note to make thing even more evil the call chains are dynamically 
changed
based on whether an inode has a client or not.
So in the case of no cxfs client the call chain is about the same as 
local xfs,
but when a client come in and cxfs will insert more behaviors / vop's 
that hooks
up all the cluster management stuff for that inode.

#define VOP_FSYNC(vp,f,cr,b,e,rv)                                       \
       rv = _VOP_(vop_fsync, vp)((vp)->v_fbhv,f,cr,b,e)

vnodeops_t xfs_vnodeops = {
       .vop_fsync              = xfs_fsync,
}

Finally, xfs_fsync actually does the work.  The best bit about all this
abstraction is that there's only one xfs_vnodeops defined!  So this could
all be done with an xfs_file_fsync() that munged its parameters and called
xfs_fsync() directly.  That wouldn't even affect IRIX combatability,
but it would make life difficult for CXFS, apparently.

So some of my ex co-workers at SGI will disagree with the following but...

The VOP's that are left in XFS are completely pointless at this point, 
since xfs never has
anything other than one call chain it shouldn't have to deal with all 
that stuff in local mode.

All the behavior call chaining should be handled by CXFS and thus all 
the VOP code should
be pushed to that code base.  I think 4 VOP calls that are used 
internally by
XFS and such the callers of those vop's may need something else that 
provides away
or re-entering the call chain at the top.

I have done some of the work in terms of just replacing the VOP calls 
with straight calls
to the final functions in the hopes of tossing the vnodeops out of XFS.
And I spec'd out a way of fixing CXFS to deal with the vops internally 
but unfortunately
that kind of work will always fall under the ENORESOURCES category.

I know SGI will never take it as long CXFS lives, but maybe someday when
the SGI finally fizzles... :-)

Ohh and the whole IRIX compat is crap at this point since many of the 
vop call
params have been changed to match linux params.

http://oss.sgi.com/projects/xfs/mail_archive/200308/msg00214.html
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html