Re: [PATCHSET v10r1d2 00/17] xfs: encode parent pointer name in xattr key

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Mar 28, 2023 at 05:46:33PM -0700, Darrick J. Wong wrote:
> On Wed, Mar 29, 2023 at 11:19:00AM +1100, Dave Chinner wrote:
> > On Tue, Mar 28, 2023 at 04:54:49PM -0700, Darrick J. Wong wrote:
> > > On Wed, Mar 29, 2023 at 09:29:59AM +1100, Dave Chinner wrote:
> > > > On Mon, Mar 27, 2023 at 06:29:32PM -0700, Darrick J. Wong wrote:
> > > > > On Sun, Mar 26, 2023 at 06:21:17AM +0300, Amir Goldstein wrote:
> > > > > > On Sat, Mar 25, 2023 at 8:01 PM Darrick J. Wong <djwong@xxxxxxxxxx> wrote:
> > > > > > > On Sat, Mar 25, 2023 at 10:59:16AM +0300, Amir Goldstein wrote:
> > > > > > > > On Fri, Mar 24, 2023 at 8:19 PM Allison Henderson
> > > > > > > > <allison.henderson@xxxxxxxxxx> wrote:
> > > > > > Right.
> > > > > > So how about
> > > > > > (parent_ino, parent_gen, dirent_hash) -> (dirent_name)
> > > > > > 
> > > > > > This is not a breaking change and you won't need to do another
> > > > > > breaking change later.
> > > > > > 
> > > > > > This could also be internal to VLOOKUP: appended vhash to
> > > > > > attr name and limit VLOOKUP name size to  255 - vhashsize.
> > > > > 
> > > > > The original "put the hash in the xattr name" patches did that, but I
> > > > > discarded that approach because it increases the size of each parent
> > > > > pointer by 4 bytes, and was really easy to make a verrrry slow
> > > > > filesystem:
> > > > 
> > > > Right, that's because hash collision resolution is very dumb. It's
> > > > just a linear walk from start to end comparing the names of each
> > > > coliding entry.
> > > > 
> > > > > I wrote an xfs_io command to compute lists of names with the same
> > > > > dirhash value.  If I created a giant directory with the same file
> > > > > hardlinked millions of times where all those dirent names all hash to
> > > > > the same value, lookups in the directory gets really slow because the
> > > > > dabtree code has to walk (on average) half a million dirents to find the
> > > > > matching one.
> > > > 
> > > > That's worst case behaviour, not real-world production behaviour.
> > > > It's also a dabtree optimisation problem, not a parent pointer
> > > > problem. i.e. if this is a real world issue, we need to fix the
> > > > dabtree indexing, not work around it in the parent pointer format.
> > > > 
> > > > IOWs, optimising the parent pointer structure for the *rare* corner
> > > > case of an intentionally crafted dahash btree collision farm is, to
> > > > me, the wrong way to design a structure.
> > > > 
> > > > Optimising it for the common fast path (single parent xattrs,
> > > > followed by small numbers of parents) is what we should be doing,
> > > > because that situation will occur an uncountable number of times
> > > > more frequently than this theorectical worst case behaviour. i.e.
> > > > the worst case will, in all likelihood, never happen in production
> > > > environments - behaviour in this circumstance is only a
> > > > consideration for mitigating malicious attacks.
> > > > 
> > > > And, quite frankly, nobody is going to bother attacking a filesystem
> > > > this way - there is literally nothing that can be gained from
> > > > slowing down access to a single directory in this manner. It's a DOS
> > > > at best, but there are *much* easier ways of doing that and it
> > > > doesn't require parent pointers or dahash collisions at all.
> > > > 
> > > > 
> > > > > 
> > > > > There were also millions of parent pointer xattrs, all with the same
> > > > > attr name and hence the same hash value here too.  Doing that made the
> > > > > performance totally awful.  Changing the hash to crc32c and then sha512
> > > > > made it much harder to induce collision slowdowns on both ends like
> > > > > that, though sha512 added a noticeable performance hit for what it was
> > > > > preventing.
> > > > 
> > > > So why not change the dahash for parent pointer enabled filesystems
> > > > to use crc32c everywhere?
> > > 
> > > That's not difficult to do, but it'll break name obscuring in metadump
> > > unless you know of a quick way to derive B from A and maintain crc32c(A)
> > > == crc32c(B).

Yeah.  So I've sort of figured out how to spoof crc32c in much the same
way that obfuscate_name() does for the dahash -- fill a buffer with
replacement characters, figure out which bits to flip to end up with the
same crc at the end.  I quickly found Mark Adler's spoof program which
seems to work more reliably.

But after a bit of performance comparison, I realized that constructing
namehash collisions with same-length strings is about as computationally
easy with crc32c as it is with the dahash.  This isn't any better than
what we have now.  I ran a dumb benchmark of the kernel crc32c
implementation vs. dahash and noticed that for the sizes we care about
(i.e. 255 bytes or less) the old dahash is only slightly slower than
hardware accelerated crc32c.

I then noticed that obfuscate_name() totally does the wrong thing if
ascii-ci is enabled, and indeed, trying to metadump such a filesystem
causes metadump to throw errors all over the place.

In short: doing this doesn't buy us anything, but now someone has to
implement a robust obfuscate_name() for crc32c and we have more software
to maintain.

> > Not easily, but....
> > 
> > > Granted I had mostly written off name obscuring in metadump on parent
> > > pointer filesystems anyway.
> > 
> > ... so had I. The name obfuscation needs a completely different
> > approach for PP enabled filesystems because of the name also being
> > embedded in non-user visible structures and potentially changing
> > attr tree dahash-based indexing.
> 
> <nod> I've wondered if we just have to record (dir_ino/name) ->
> (newname) and use that to update the parent pointer and any dabtree
> entries we find.  That's going to be a bit of a mess to iron out since
> we can't change the "mounted" filesystem.

For now I've fixed this by implementing a gigantic name remapping table
so that we remember (dir_ino, old_name, new_name) tuples.  When we
encounter the equivalent parent pointer, we can use the exact same
obfuscated name, which means that xfs_repair on a restored obscured
metadump works again.

While I was at it, I also implemented a 'parent' command for xfs_db that
dumps the parents of any inode, similar to the existing xfs_io command.

> > > > I'd much prefer we have a parent pointer index key that is fixed
> > > > length, has constant hash time and a dabtree hash that is far more
> > > > resilient to collision farming than to have to perform
> > > > encoding/decoding of varaible length values into the key to work
> > > > around a collision-rich hashing algorithm.
> > > 
> > > Wellllll I was 10 minutes away from sending v11 with all of my changes
> > > integrated, but I guess now I'll go rework the dabtree to use crc32c,
> > > encode the crc32c in the parent pointer attr name, and get rid of all
> > > the variable length value decoding.  Now that I just got rid of all the
> > > intermediate patches with hash-in-attr-name.
> > 
> > I apologise for not being able to comment on this earlier. What time
> > I've had available over the past coule of weeks has been focussed on
> > the parts of the repair patchset that are ready for merge and I've
> > had little time to follow the progress of the parts you are
> > currently developing.
> 
> <nod> Sorry, that was my kneejerk reaction to "let's go reopen this
> thing and go back to where you were a week ago".  I appreciate yuor
> help getting the online repair patchset ready for merge.
> 
> In the end it took about half an hour to put everything back, so I'll go
> test that overnight and let's see where we end up tomorrow.

Now that I've had a day to go sift through all of these changes, I've
realized that you're right -- putting the ondisk format back to:

   (parent_ino, parent_gen, namehash) -> (dirent name)

Simplifes the parent pointer ondisk conversion and runtime code quite a
bit.  It also makes it easy to implement obfuscated metadumps.  So I'll
go run v11 through fstestscloud with all of today's changes integrated,
and we'll see about getting something out in a few days.

Again, sorry for being a jerk.

--D

> --D
> 
> > -Dave.
> > -- 
> > Dave Chinner
> > david@xxxxxxxxxxxxx



[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux