Re: parent xattrs on file objects

Sage Weil <sage@xxxxxxxxxxx> · Fri, 19 Oct 2012 14:17:27 -0700 (PDT)

On Wed, 17 Oct 2012, Adam C. Emerson wrote:
> Mr. Farnum,
> 
> At Wed, 17 Oct 2012 15:04:23 -0700, Gregory Farnum wrote:
> > 
> > I still don't get it. Putting every inode's primary link in a lookup
> > directory and then patching the lookup code to go there makes sense to
> > me. But if you have to go the other way (from the inode directory's
> > secondary link to some other location as the primary link), you need
> > an up-to-date path for that primary link, right? How do you handle it
> > when the path changes ? do you have a two-phase commit on the lookup
> > directory attributes?
> 
> Our idea isn't to have the inode directory contain links back to the
> primary.  Our idea is to have a structure managed by MDSs that is
> looked up by inode number and spread across the MDSs in a cluster
> similarly to the way CRUSH maps files across OSDs.  This structure
> contains all the information currently in the inode that's now
> incorporated into the dirent.
> 
> The dirents would then contain mappings from mappings from names to
> inodes and possibly cache (but not be the primary for) inode content.
> We were also planning to change directory fragmentation to distribute
> fragments across MDSs based on a function of the filename, also
> similarly to how CRUSH maps objects to OSDs.

I think this basic approach is viable.  However, I'm hesitant to give up 
on embedded inodes because of the huge performance wins in the common 
cases; I'd rather have a more expensive lookup-by-ino in the rare cases 
where nfs filehandles are out of cache and be smokin' fast the rest of the 
time.

Are there reasons you're attached to your current approach?  Do you see 
problems with a generalized "find this ino" function based on the file 
objects?  I like the latter because it

 - means we can scrap the anchor table, which needs additional work anyway 
   if it is going to scale
 - is generallly useful for fsck
 - solves the NFS fh issue

The only real downsides I see to this approach are:

 - more OSD ops (setxattrs.. if we're smart, they'll be cheap)
 - lookup-by-ino for resolving hard links may be slower than the anchor 
   table, which gives you *all* ancestors in one lookup, vs this, which 
   may range from 1 lookup to (depth of tree) lookups (or possibly more, 
   in rare cases).  For all the reasons that the anchor table was 
   acceptable for hard links, though (hard link rarity, parallel link 
   patterns), I can live with it.

There are also lots of people who seem to be putting BackupPC (or whatever 
is it) on Ceph, which is creating huge messes of hard links, so it will be 
really good to solve/avoid teh current anchor table scaling problems.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html