Re: [PATCH] xfs: reserve fields in inode for parent ptr and alloc policy

Dave Chinner <david@xxxxxxxxxxxxx> · Fri, 12 Apr 2013 10:24:38 +1000

On Thu, Apr 11, 2013 at 08:54:20AM -0500, Mark Tinguely wrote:
> On 04/10/13 22:00, Dave Chinner wrote:
> >On Wed, Apr 10, 2013 at 01:24:24PM -0500, Mark Tinguely wrote:
> >>Reserve fields in new inode layout for parent pointer and
> >>allocation policy.
> >
> >Not without a design review - there is no way we are going to
> >blindly reserve space in any on-disk structure without first having
> >a solid design, published proof-of-concept code and agreement that
> >the format changes being proposed are the right way to proceed.
> >
> >Where are the design docs/code for these features?
> 
> Then you might want to bump the pad size.

Not without good reason. Padding is not designed to accomodate new
features that require 16 bytes of core inode space. If you need that
amount of space in the inode, we have EAs for that....

The plan of record for the past 5-6 years has been that parent
pointers require an arbitrary amount of space in the inode so they
need to be entirely EA based. Hence I'm not prepared to agree to an
inode core change for a shiny new parent pointer implementation I
first heard about 2 days ago and know absolutely nothing about.

You need to show everyone the code so we can get some idea of what
benefit putting this information in the inode core will bring us
over a purely EA based implementation.

FWIW, inodes are versioned and it's trivial to bump the inode
version and add feature bit if an inode core format change is
absolutely necessary. From that perspective, I don't see any need to
rush a change at this point in time.

> >>		----
> >>The inode will hold the parent information for the first
> >>link to a file. Information for the other links will be
> >>held in extended attribute entries.
> >>
> >>The "di_parino" is the inode of the parent directory. The
> >>directory information for this entry is located the parent
> >>directory with "di_paroff" offset.
> >>
> >>The di_parino/di_paroff concept code is running.
> >
> >How does it handle hard links? (i.e. multiple parents)
> 
> As I stated, the hard links will use extended attribute entries.

That will add significant complexity of the implementation.

> >Also, inode number is not enough for unique identification of the
> >parent - generation number is also required.
> 
> Per the 2005 XFS features meeting,

I was at those meetings and wasn't allowed to keep a copy of those
notes when I left SGI. If you want to refer to them, you'll need to
post the notes so everyone knows what was discussed. ;)

> the inode and directory offset
> will uniquely describe a link

It doesn't, though. An inode can only be uniquely identified by
the combination of the inode+generation number. The inode number
identifies an inode, but it doesn't identify the lifecycle instance
of the inode that the reference is valid for.

If this parent pointer link is going to be in any way useful for
metadata scrubbing and repair, then it has to identify the instance
of the parent inode that it points to.

Note that I'm not saying that the EA format in the 2009 patch set is
perfect - I think it needs a couple of significant tweaks. What I am
saying is that there is good reasons for the format of information
in that EA.

> - (thank-you to Glen Overby for that
> observation). The concept code just using one link verifies this
> fact. Using the inode/offset as the identifier is compact and also
> gives information to the file name.

Yup, using d_off was also considered, but has problems, too.  The
problem with pointing to the directory offset is that if the
directory is rebuilt for any reason (e.g. by xfs_repair), then the
directory offset in the child inodes is now invalid and needs to
also be fixed.

And then you have the problem of needing multiple parent links for a
single parent directory as you have to instantiate every single hard
link to the same directory. In the current EA patchset, this does
not occur as the parent structure simply counts the number of
references to a given parent directory.

The EA approach just points to the parent directory and keeps a
count of the number of links back to the directory. Hence as long as
repair doesn't reallocate the directory inode, the parent pointer
information remains intact. i.e. There is less dependency between
the objects and so the format is much more robust against unexpected
changes to directory structures.

This concedes the fact that inode number to name resolution needs to
be a userspace problem. This is something we have to conceed because
the presence of namespaces, security models, bind mounts and chroot
environments use top-down path traversal to limit what part of the
filesystem a userspace process can actually see. In-kernel
resolution of reverse path names is a potential vector for
information leakage between namespaces and at it's worst, a provide
a method for escape. Hence there are security concerns about parent
pointers that we have to take into account as well.

The EA approach is also allows more flexibility for potential future
directory features. For example: online defragmentation. If we want defrag
to be able to compact directories into a contiguous address space,
we need to be able to change the offsets of the data entries. We can
do this if there are no open references to the directory at the time
of the defrag. However, if we fix the offsets of directory entries
in the child inode parent pointers, then this form of defrag becomes
impossible to implement as we cannot update the directory and all
the child inodes atomically from a userspace or transactional
POV. IOWs, the parent pointer format containing an exact directory
offset also limits what we can do with directories in future....

> >That uses xattrs to store parent information - inode #, generation
> >#, and a counter - for each parent. It requires a counter because
> >you can have the same inode hard linked into the same parent
> >directory multiple times:
> >
> >typedef struct xfs_parent_eaname_rec {
> >        __be64  p_ino;
> >        __be32  p_gen;
> >        __be32  p_cnt;
> >} xfs_parent_eaname_rec_t;
> >
> >And a transaction appended to the the inode create, link, rename and
> >remove operations to manage the xattrs so all cases involving hard
> >links work just fine.
> >
> >Indeed, the single di_parino/di_paroff concept was one of the
> >original designs considered because of it's simplicity, but it was
> >rejected because of that very simplicity - it cannot handle common
> >use cases involving hardlinks.
> 
> According to Geoffrey's notes, the hybrid approach was discussed
> too. The single link case will save all the EA operations for the
> majority of the inodes.

Sure, those meetings were brain storming about how we *might* solve
a problem, but keep in mind the context of those meetings was that
all the first and second generation XFS gurus had left SGI and in
that vaccuum nobody left at SGI fully understood the problem domains
being discussed.

As such, there were some really crazy ideas discussed and
documented, but that doesn't mean they were the best solution or
even a good idea.  We came up with several possible ways to
implement parent pointers, but the real world sorted them out pretty
quickly after the brain storming sessions.

Indeed, the real world broke the initial implementation of parent
pointers pretty quickly. You may not know the entire history, but
XFS on Irix effectively ended up with a useless, broken parent
pointer implementation because it was implemented quickly based on
assumptions made in those brainstorming seesions meeting.  i.e.
still without fully understanding the problem space. The limitations
the Irix implementation exposed resulted in it being thrown away
completely and redesigned for Linux. IOWs, the EA based patchset is
what evolved from the hard lessons that Irix taught us....

> >Release early, release often?
> 
> No, trust but verify.

Verification can't be done if you don't tell us anything about what
you are doing.  Verification needs to be done at the design/POC
phase just as much as at the final patch review stage, because there
are many parties that have different requirements for the same
feature.

Open source development is supposed to be, well, open. Public. Not
behind closed doors. Design is open, implementation is open,
flamewars aout stuff are out there in public. Nothing is done behind
closed doors.

Saying "we're doing X" and then not telling anyone about it is the
antithesis of OSS. It -excludes- the community from the process of
developing the new features, and prevents them from verifying what
you are doing is suitable for their needs.

IOWs, there is no possibility of "trust, but verify" in OSS without
"release early, release often". You develop trust by being open and
allowing people to review and verify what you are doing at every
stage of development - from POC through to production code.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs