On Sat, Mar 25, 2023 at 8:01 PM Darrick J. Wong <djwong@xxxxxxxxxx> wrote: > > On Sat, Mar 25, 2023 at 10:59:16AM +0300, Amir Goldstein wrote: > > On Fri, Mar 24, 2023 at 8:19 PM Allison Henderson > > <allison.henderson@xxxxxxxxxx> wrote: > > > > > > On Thu, 2023-03-16 at 12:17 -0700, Darrick J. Wong wrote: > > > > Hi all, > > > > > > > > As I've mentioned in past comments on the parent pointers patchset, > > > > the > > > > proposed ondisk parent pointer format presents a major difficulty for > > > > online directory repair. This difficulty derives from encoding the > > > > directory offset of the dirent that the parent pointer is mirroring. > > > > Recall that parent pointers are stored in extended attributes: > > > > > > > > (parent_ino, parent_gen, diroffset) -> (dirent_name) > > > > > > > > If the directory is rebuilt, the offsets of the new directory entries > > > > must match the diroffset encoded in the parent pointer, or the > > > > filesystem becomes inconsistent. There are a few ways to solve this > > > > problem. > > > > > > > > One approach would be to augment the directory addname function to > > > > take > > > > a diroffset and try to create the new entry at that offset. This > > > > will > > > > not work if the original directory became corrupt and the parent > > > > pointers were written out with impossible diroffsets (e.g. > > > > overlapping). > > > > Requiring matching diroffsets also prevents reorganization and > > > > compaction of directories. > > > > > > > > This could be remedied by recording the parent pointer diroffset > > > > updates > > > > necessary to retain consistency, and using the logged parent pointer > > > > replace function to rewrite parent pointers as necessary. This is a > > > > poor choice from a performance perspective because the logged xattr > > > > updates must be committed in the same transaction that commits the > > > > new > > > > directory structure. If there are a large number of diroffset > > > > updates, > > > > then the directory commit could take an even longer time. > > > > > > > > Worse yet, if the logged xattr updates fill up the transaction, > > > > repair > > > > will have no choice but to roll to a fresh transaction to continue > > > > logging. This breaks repair's policy that repairs should commit > > > > atomically. It may break the filesystem as well, since all files > > > > involved are pinned until the delayed pptr xattr processing > > > > completes. > > > > This is a completely bad engineering choice. > > > > > > > > Note that the diroffset information is not used anywhere in the > > > > directory lookup code. Observe that the only information that we > > > > require for a parent pointer is the inverse of an pre-ftype dirent, > > > > since this is all we need to reconstruct a directory entry: > > > > > > > > (parent_ino, dirent_name) -> NULL > > > > > > > > The xattr code supports xattrs with zero-length values, surprisingly. > > > > The parent_gen field makes it easy to export parent handle > > > > information, > > > > so it can be retained: > > > > > > > > (parent_ino, parent_gen, dirent_name) -> NULL > > > > > > > > Moving the ondisk format to this format is very advantageous for > > > > repair > > > > code. Unfortunately, there is one hitch: xattr names cannot exceed > > > > 255 > > > > bytes due to ondisk format limitations. We don't want to constrain > > > > the > > > > length of dirent names, so instead we create a special VLOOKUP mode > > > > for > > > > extended attributes that allows parent pointers to require matching > > > > on > > > > both the name and the value. > > > > > > > > The ondisk format of a parent pointer can then become: > > > > > > > > (parent_ino, parent_gen, dirent_name[0:242]) -> > > > > (dirent_name[243:255]) > > > > With VLOOKUP in place, why is this better than > > (parent_ino, parent_gen) -> (dirent_name) > > > > Is it because the dabtree hash is calculated only on the key > > and changing that would be way more intrusive? > > Yes. The dabtree hash is calculated on the full attr name, so this is > an attempt to reduce hash collisions during parent pointer lookups by > stuffing as many bytes in the attr name as possible. > > It would be very easy to change the xfs_da_hashname calls inside > xfs_parent.c to use either the full dirent name (i.e. pptr hash matches > dirent hash) or use the full parent pointer and not just the attr key, > but that would be a major break from the tradition that the da hash is > computed against the attr name only. > > (I'm not opposed to doing that too, but one breaking change at a time. > ;)) > > > Does that mean that user can create up to 2^12 > > parent pointers with the same hash and we are fine with it? > > Yes. The dabtree can handle an xattr structure where every name hashes > to the same value, though performance will be slow as it scans every > attr to find the one it wants. > > The number of parent pointers with the same hash can be much higher > than 2^12 -- I wrote a dumb C program that creates an arbitrary number > of attr names with identical hashes. It's not fast when you get past > 100,000. :P > Right. So how about (parent_ino, parent_gen, dirent_hash) -> (dirent_name) This is not a breaking change and you won't need to do another breaking change later. This could also be internal to VLOOKUP: appended vhash to attr name and limit VLOOKUP name size to 255 - vhashsize. Thanks, Amir.