Re: [PATCHSET v10r1d2 00/17] xfs: encode parent pointer name in xattr key

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Mar 28, 2023 at 4:29 AM Darrick J. Wong <djwong@xxxxxxxxxx> wrote:
>
> On Sun, Mar 26, 2023 at 06:21:17AM +0300, Amir Goldstein wrote:
> > On Sat, Mar 25, 2023 at 8:01 PM Darrick J. Wong <djwong@xxxxxxxxxx> wrote:
> > >
> > > On Sat, Mar 25, 2023 at 10:59:16AM +0300, Amir Goldstein wrote:
> > > > On Fri, Mar 24, 2023 at 8:19 PM Allison Henderson
> > > > <allison.henderson@xxxxxxxxxx> wrote:
> > > > >
> > > > > On Thu, 2023-03-16 at 12:17 -0700, Darrick J. Wong wrote:
> > > > > > Hi all,
> > > > > >
> > > > > > As I've mentioned in past comments on the parent pointers patchset,
> > > > > > the
> > > > > > proposed ondisk parent pointer format presents a major difficulty for
> > > > > > online directory repair.  This difficulty derives from encoding the
> > > > > > directory offset of the dirent that the parent pointer is mirroring.
> > > > > > Recall that parent pointers are stored in extended attributes:
> > > > > >
> > > > > >     (parent_ino, parent_gen, diroffset) -> (dirent_name)
> > > > > >
> > > > > > If the directory is rebuilt, the offsets of the new directory entries
> > > > > > must match the diroffset encoded in the parent pointer, or the
> > > > > > filesystem becomes inconsistent.  There are a few ways to solve this
> > > > > > problem.
> > > > > >
> > > > > > One approach would be to augment the directory addname function to
> > > > > > take
> > > > > > a diroffset and try to create the new entry at that offset.  This
> > > > > > will
> > > > > > not work if the original directory became corrupt and the parent
> > > > > > pointers were written out with impossible diroffsets (e.g.
> > > > > > overlapping).
> > > > > > Requiring matching diroffsets also prevents reorganization and
> > > > > > compaction of directories.
> > > > > >
> > > > > > This could be remedied by recording the parent pointer diroffset
> > > > > > updates
> > > > > > necessary to retain consistency, and using the logged parent pointer
> > > > > > replace function to rewrite parent pointers as necessary.  This is a
> > > > > > poor choice from a performance perspective because the logged xattr
> > > > > > updates must be committed in the same transaction that commits the
> > > > > > new
> > > > > > directory structure.  If there are a large number of diroffset
> > > > > > updates,
> > > > > > then the directory commit could take an even longer time.
> > > > > >
> > > > > > Worse yet, if the logged xattr updates fill up the transaction,
> > > > > > repair
> > > > > > will have no choice but to roll to a fresh transaction to continue
> > > > > > logging.  This breaks repair's policy that repairs should commit
> > > > > > atomically.  It may break the filesystem as well, since all files
> > > > > > involved are pinned until the delayed pptr xattr processing
> > > > > > completes.
> > > > > > This is a completely bad engineering choice.
> > > > > >
> > > > > > Note that the diroffset information is not used anywhere in the
> > > > > > directory lookup code.  Observe that the only information that we
> > > > > > require for a parent pointer is the inverse of an pre-ftype dirent,
> > > > > > since this is all we need to reconstruct a directory entry:
> > > > > >
> > > > > >     (parent_ino, dirent_name) -> NULL
> > > > > >
> > > > > > The xattr code supports xattrs with zero-length values, surprisingly.
> > > > > > The parent_gen field makes it easy to export parent handle
> > > > > > information,
> > > > > > so it can be retained:
> > > > > >
> > > > > >     (parent_ino, parent_gen, dirent_name) -> NULL
> > > > > >
> > > > > > Moving the ondisk format to this format is very advantageous for
> > > > > > repair
> > > > > > code.  Unfortunately, there is one hitch: xattr names cannot exceed
> > > > > > 255
> > > > > > bytes due to ondisk format limitations.  We don't want to constrain
> > > > > > the
> > > > > > length of dirent names, so instead we create a special VLOOKUP mode
> > > > > > for
> > > > > > extended attributes that allows parent pointers to require matching
> > > > > > on
> > > > > > both the name and the value.
> > > > > >
> > > > > > The ondisk format of a parent pointer can then become:
> > > > > >
> > > > > >     (parent_ino, parent_gen, dirent_name[0:242]) ->
> > > > > > (dirent_name[243:255])
> > > >
> > > > With VLOOKUP in place, why is this better than
> > > > (parent_ino, parent_gen) -> (dirent_name)
> > > >
> > > > Is it because the dabtree hash is calculated only on the key
> > > > and changing that would be way more intrusive?
> > >
> > > Yes.  The dabtree hash is calculated on the full attr name, so this is
> > > an attempt to reduce hash collisions during parent pointer lookups by
> > > stuffing as many bytes in the attr name as possible.
> > >
> > > It would be very easy to change the xfs_da_hashname calls inside
> > > xfs_parent.c to use either the full dirent name (i.e. pptr hash matches
> > > dirent hash) or use the full parent pointer and not just the attr key,
> > > but that would be a major break from the tradition that the da hash is
> > > computed against the attr name only.
> > >
> > > (I'm not opposed to doing that too, but one breaking change at a time.
> > > ;))
> > >
> > > > Does that mean that user can create up to 2^12
> > > > parent pointers with the same hash and we are fine with it?
> > >
> > > Yes.  The dabtree can handle an xattr structure where every name hashes
> > > to the same value, though performance will be slow as it scans every
> > > attr to find the one it wants.
> > >
> > > The number of parent pointers with the same hash can be much higher
> > > than 2^12 -- I wrote a dumb C program that creates an arbitrary number
> > > of attr names with identical hashes.  It's not fast when you get past
> > > 100,000. :P
> > >
> >
> > Right.
> > So how about
> > (parent_ino, parent_gen, dirent_hash) -> (dirent_name)
> >
> > This is not a breaking change and you won't need to do another
> > breaking change later.
> >
> > This could also be internal to VLOOKUP: appended vhash to
> > attr name and limit VLOOKUP name size to  255 - vhashsize.
>
> The original "put the hash in the xattr name" patches did that, but I
> discarded that approach because it increases the size of each parent
> pointer by 4 bytes, and was really easy to make a verrrry slow
> filesystem:
>
> I wrote an xfs_io command to compute lists of names with the same
> dirhash value.  If I created a giant directory with the same file
> hardlinked millions of times where all those dirent names all hash to
> the same value, lookups in the directory gets really slow because the
> dabtree code has to walk (on average) half a million dirents to find the
> matching one.
>
> There were also millions of parent pointer xattrs, all with the same
> attr name and hence the same hash value here too.  Doing that made the
> performance totally awful.  Changing the hash to crc32c and then sha512
> made it much harder to induce collision slowdowns on both ends like
> that, though sha512 added a noticeable performance hit for what it was
> preventing.
>

OK, but this attack is applicable to the dabtree hash itself only with
the proposed dirent_name[0:242] key, is it not?

> Hopefully the fact that the attr name starts with 12 bytes (4 of which
> aren't known to unprivileged userspace) and omits the last 12 bytes of
> the dirent name will make it harder to generate double-collisions.
> Granted there's probably someone who's more math-smart than me who can
> figure this out.

My concern is that generating 7^12 collisions using all the possible
printable suffix dirent_name[243:255] is so simple that even a non-malicious
but rather unlucky script would end up hitting those collisions:
Report_$ORGANIZATION_$DEPARTEMT_$PRODUCT_$PROJECT_...
$GROUP_$OWNER_$DATE_$TIME_$ID.pdf
You get what I mean...

So my thinking was that the unlucky should not suffer the same penalty
as the malicious.

Thanks,
Amir.




[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux