On Fri, Mar 24, 2023 at 8:19 PM Allison Henderson <allison.henderson@xxxxxxxxxx> wrote: > > On Thu, 2023-03-16 at 12:17 -0700, Darrick J. Wong wrote: > > Hi all, > > > > As I've mentioned in past comments on the parent pointers patchset, > > the > > proposed ondisk parent pointer format presents a major difficulty for > > online directory repair. This difficulty derives from encoding the > > directory offset of the dirent that the parent pointer is mirroring. > > Recall that parent pointers are stored in extended attributes: > > > > (parent_ino, parent_gen, diroffset) -> (dirent_name) > > > > If the directory is rebuilt, the offsets of the new directory entries > > must match the diroffset encoded in the parent pointer, or the > > filesystem becomes inconsistent. There are a few ways to solve this > > problem. > > > > One approach would be to augment the directory addname function to > > take > > a diroffset and try to create the new entry at that offset. This > > will > > not work if the original directory became corrupt and the parent > > pointers were written out with impossible diroffsets (e.g. > > overlapping). > > Requiring matching diroffsets also prevents reorganization and > > compaction of directories. > > > > This could be remedied by recording the parent pointer diroffset > > updates > > necessary to retain consistency, and using the logged parent pointer > > replace function to rewrite parent pointers as necessary. This is a > > poor choice from a performance perspective because the logged xattr > > updates must be committed in the same transaction that commits the > > new > > directory structure. If there are a large number of diroffset > > updates, > > then the directory commit could take an even longer time. > > > > Worse yet, if the logged xattr updates fill up the transaction, > > repair > > will have no choice but to roll to a fresh transaction to continue > > logging. This breaks repair's policy that repairs should commit > > atomically. It may break the filesystem as well, since all files > > involved are pinned until the delayed pptr xattr processing > > completes. > > This is a completely bad engineering choice. > > > > Note that the diroffset information is not used anywhere in the > > directory lookup code. Observe that the only information that we > > require for a parent pointer is the inverse of an pre-ftype dirent, > > since this is all we need to reconstruct a directory entry: > > > > (parent_ino, dirent_name) -> NULL > > > > The xattr code supports xattrs with zero-length values, surprisingly. > > The parent_gen field makes it easy to export parent handle > > information, > > so it can be retained: > > > > (parent_ino, parent_gen, dirent_name) -> NULL > > > > Moving the ondisk format to this format is very advantageous for > > repair > > code. Unfortunately, there is one hitch: xattr names cannot exceed > > 255 > > bytes due to ondisk format limitations. We don't want to constrain > > the > > length of dirent names, so instead we create a special VLOOKUP mode > > for > > extended attributes that allows parent pointers to require matching > > on > > both the name and the value. > > > > The ondisk format of a parent pointer can then become: > > > > (parent_ino, parent_gen, dirent_name[0:242]) -> > > (dirent_name[243:255]) With VLOOKUP in place, why is this better than (parent_ino, parent_gen) -> (dirent_name) Is it because the dabtree hash is calculated only on the key and changing that would be way more intrusive? Does that mean that user can create up to 2^12 parent pointers with the same hash and we are fine with it? I don't think it is a problem, just wanted to understand and ask that the reason for this part in the format be explained. Thanks, Amir.