Re: [PATCHSET v10r1d2 00/17] xfs: encode parent pointer name in xattr key

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sat, Mar 25, 2023 at 8:01 PM Darrick J. Wong <djwong@xxxxxxxxxx> wrote:
>
> On Sat, Mar 25, 2023 at 10:59:16AM +0300, Amir Goldstein wrote:
> > On Fri, Mar 24, 2023 at 8:19 PM Allison Henderson
> > <allison.henderson@xxxxxxxxxx> wrote:
> > >
> > > On Thu, 2023-03-16 at 12:17 -0700, Darrick J. Wong wrote:
> > > > Hi all,
> > > >
> > > > As I've mentioned in past comments on the parent pointers patchset,
> > > > the
> > > > proposed ondisk parent pointer format presents a major difficulty for
> > > > online directory repair.  This difficulty derives from encoding the
> > > > directory offset of the dirent that the parent pointer is mirroring.
> > > > Recall that parent pointers are stored in extended attributes:
> > > >
> > > >     (parent_ino, parent_gen, diroffset) -> (dirent_name)
> > > >
> > > > If the directory is rebuilt, the offsets of the new directory entries
> > > > must match the diroffset encoded in the parent pointer, or the
> > > > filesystem becomes inconsistent.  There are a few ways to solve this
> > > > problem.
> > > >
> > > > One approach would be to augment the directory addname function to
> > > > take
> > > > a diroffset and try to create the new entry at that offset.  This
> > > > will
> > > > not work if the original directory became corrupt and the parent
> > > > pointers were written out with impossible diroffsets (e.g.
> > > > overlapping).
> > > > Requiring matching diroffsets also prevents reorganization and
> > > > compaction of directories.
> > > >
> > > > This could be remedied by recording the parent pointer diroffset
> > > > updates
> > > > necessary to retain consistency, and using the logged parent pointer
> > > > replace function to rewrite parent pointers as necessary.  This is a
> > > > poor choice from a performance perspective because the logged xattr
> > > > updates must be committed in the same transaction that commits the
> > > > new
> > > > directory structure.  If there are a large number of diroffset
> > > > updates,
> > > > then the directory commit could take an even longer time.
> > > >
> > > > Worse yet, if the logged xattr updates fill up the transaction,
> > > > repair
> > > > will have no choice but to roll to a fresh transaction to continue
> > > > logging.  This breaks repair's policy that repairs should commit
> > > > atomically.  It may break the filesystem as well, since all files
> > > > involved are pinned until the delayed pptr xattr processing
> > > > completes.
> > > > This is a completely bad engineering choice.
> > > >
> > > > Note that the diroffset information is not used anywhere in the
> > > > directory lookup code.  Observe that the only information that we
> > > > require for a parent pointer is the inverse of an pre-ftype dirent,
> > > > since this is all we need to reconstruct a directory entry:
> > > >
> > > >     (parent_ino, dirent_name) -> NULL
> > > >
> > > > The xattr code supports xattrs with zero-length values, surprisingly.
> > > > The parent_gen field makes it easy to export parent handle
> > > > information,
> > > > so it can be retained:
> > > >
> > > >     (parent_ino, parent_gen, dirent_name) -> NULL
> > > >
> > > > Moving the ondisk format to this format is very advantageous for
> > > > repair
> > > > code.  Unfortunately, there is one hitch: xattr names cannot exceed
> > > > 255
> > > > bytes due to ondisk format limitations.  We don't want to constrain
> > > > the
> > > > length of dirent names, so instead we create a special VLOOKUP mode
> > > > for
> > > > extended attributes that allows parent pointers to require matching
> > > > on
> > > > both the name and the value.
> > > >
> > > > The ondisk format of a parent pointer can then become:
> > > >
> > > >     (parent_ino, parent_gen, dirent_name[0:242]) ->
> > > > (dirent_name[243:255])
> >
> > With VLOOKUP in place, why is this better than
> > (parent_ino, parent_gen) -> (dirent_name)
> >
> > Is it because the dabtree hash is calculated only on the key
> > and changing that would be way more intrusive?
>
> Yes.  The dabtree hash is calculated on the full attr name, so this is
> an attempt to reduce hash collisions during parent pointer lookups by
> stuffing as many bytes in the attr name as possible.
>
> It would be very easy to change the xfs_da_hashname calls inside
> xfs_parent.c to use either the full dirent name (i.e. pptr hash matches
> dirent hash) or use the full parent pointer and not just the attr key,
> but that would be a major break from the tradition that the da hash is
> computed against the attr name only.
>
> (I'm not opposed to doing that too, but one breaking change at a time.
> ;))
>
> > Does that mean that user can create up to 2^12
> > parent pointers with the same hash and we are fine with it?
>
> Yes.  The dabtree can handle an xattr structure where every name hashes
> to the same value, though performance will be slow as it scans every
> attr to find the one it wants.
>
> The number of parent pointers with the same hash can be much higher
> than 2^12 -- I wrote a dumb C program that creates an arbitrary number
> of attr names with identical hashes.  It's not fast when you get past
> 100,000. :P
>

Right.
So how about
(parent_ino, parent_gen, dirent_hash) -> (dirent_name)

This is not a breaking change and you won't need to do another
breaking change later.

This could also be internal to VLOOKUP: appended vhash to
attr name and limit VLOOKUP name size to  255 - vhashsize.

Thanks,
Amir.




[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux