On Thu, Feb 16, 2023 at 10:33 PM Darrick J. Wong <djwong@xxxxxxxxxx> wrote: > > Hi all, > > As I've mentioned in past comments on the parent pointers patchset, the > proposed ondisk parent pointer format presents a major difficulty for > online directory repair. This difficulty derives from encoding the > directory offset of the dirent that the parent pointer is mirroring. > Recall that parent pointers are stored in extended attributes: > > (parent_ino, parent_gen, diroffset) -> (dirent_name) > > If the directory is rebuilt, the offsets of the new directory entries > must match the diroffset encoded in the parent pointer, or the > filesystem becomes inconsistent. There are a few ways to solve this > problem. > > One approach would be to augment the directory addname function to take > a diroffset and try to create the new entry at that offset. This will > not work if the original directory became corrupt and the parent > pointers were written out with impossible diroffsets (e.g. overlapping). > Requiring matching diroffsets also prevents reorganization and > compaction of directories. > > This could be remedied by recording the parent pointer diroffset updates > necessary to retain consistency, and using the logged parent pointer > replace function to rewrite parent pointers as necessary. This is a > poor choice from a performance perspective because the logged xattr > updates must be committed in the same transaction that commits the new > directory structure. If there are a large number of diroffset updates, > then the directory commit could take an even longer time. > > Worse yet, if the logged xattr updates fill up the transaction, repair > will have no choice but to roll to a fresh transaction to continue > logging. This breaks repair's policy that repairs should commit > atomically. It may break the filesystem as well, since all files > involved are pinned until the delayed pptr xattr processing completes. > This is a completely bad engineering choice. > > Note that the diroffset information is not used anywhere in the > directory lookup code. Observe that the only information that we > require for a parent pointer is the inverse of an pre-ftype dirent, > since this is all we need to reconstruct a directory entry: > > (parent_ino, dirent_name) -> NULL > > The xattr code supports xattrs with zero-length values, surprisingly. > The parent_gen field makes it easy to export parent handle information, > so it can be retained: > > (parent_ino, parent_gen, dirent_name) -> NULL > > Moving the ondisk format to this format is very advantageous for repair > code. Unfortunately, there is one hitch: xattr names cannot exceed 255 > bytes due to ondisk format limitations. We don't want to constrain the > length of dirent names, so instead we could use collision resistant > hashes to handle dirents with very long names: > > (parent_ino, parent_gen, sha512(dirent_name)) -> (dirent_name) > > The first two patches implement this schema. However, this encoding is > not maximally efficient, since many directory names are shorter than the > length of a sha512 hash. The last three patches in the series bifurcate > the parent pointer ondisk format depending on context: > > For dirent names shorter than 243 bytes: > > (parent_ino, parent_gen, dirent_name) -> NULL > > For dirent names longer than 243 bytes: > > (parent_ino, parent_gen, dirent_name[0:178], > sha512(child_gen, dirent_name)) -> (dirent_name[179:255]) > > The child file's generation number is mixed into the sha512 computation > to make it a little more difficult for unprivileged userspace to attempt > collisions. > Naive question: Obviously, the spec of stradard xattrs does not allow duplicate keys, but dabtree does allow duplicate keys, does it not? So if you were to allow duplicate parent pointer records, e.g.: (parent_ino, parent_gen) -> dirent_name1 (parent_ino, parent_gen) -> dirent_name2 Or to optimize performance for the case of large number of sibling hardlinks of the same parent (if that case is worth optimizing): (parent_ino, parent_gen, dirent_name[0:178]) -> (dirent_name1[179:255]) (parent_ino, parent_gen, dirent_name[0:178]) -> (dirent_name2[179:255]) Then pptr code should have no problem walking all the matching parent pointer records to find the unique parent->child record that it needs to operate on? I am sure it would be more complicated than how I depicted it, but having to deal with even remote possibility of hash collisions sounds like a massive headache - having to maintain code that is really hard to test and rarely exercised is not a recipe for peace of mind... Thanks, Amir.