Re: [PATCH v2 15/20] merge-ort: step 3 of tree writing -- handling subdirectories as we go

Elijah Newren <newren@xxxxxxxxx> · Tue, 24 Nov 2020 12:19:35 -0800

On Thu, Nov 12, 2020 at 2:30 PM Elijah Newren <newren@xxxxxxxxx> wrote:
>
> On Thu, Nov 12, 2020 at 12:15 PM Jonathan Tan <jonathantanmy@xxxxxxxxxx> wrote:
> >
> > > +     /*
> > > +      * Now we've got an OID for last_directory in dir_info.  We need to
> > > +      * add it to info->versions for it to be part of the computation of
> > > +      * its parent directories' OID.  But first, we have to find out what
> > > +      * its' parent name was and whether that matches the previous
> > > +      * info->offsets or we need to set up a new one.
> > > +      */
> > > +     prev_dir = info->offsets.nr == 0 ? NULL :
> > > +                info->offsets.items[info->offsets.nr-1].string;
> > > +     if (new_directory_name != prev_dir) {
> > > +             uintptr_t c = info->versions.nr;
> > > +             string_list_append(&info->offsets,
> > > +                                new_directory_name)->util = (void*)c;
> > > +     }
> >
> > Because of the possible jump of 2 or more directories that I mentioned
> > earlier, there may be gaps in the offsets. So it makes sense that we
> > sometimes need to insert an intermediate one.
> >
> > I wonder if the code would be clearer if we had explicit "begin tree"
> > and "end tree" steps just like in list-objects-filter.c (LOFS_BEGIN_TREE
> > and LOFS_END_TREE). Here we have "end tree" (because of the way the
> > entries were sorted) but not "begin tree". If we had "begin tree", we
> > probably would be able to create the necessary offsets in a loop at that
> > stage, and the reasoning about the contents of the offsets would not be
> > so complicated.
> >
> > If we really only want one side (i.e. you don't want to introduce a
> > synthetic entry just to mark the end or the beginning), then my personal
> > experience is that having the "begin" side is easier to understand, as
> > the state is more natural and easier to reason about. (Unlike here,
> > where there could be gaps in the offsets and the reader has to
> > understand that the gaps will be filled just in time.) But that may just
> > be my own experience.
>
> Interesting, I'll take a look into it.
>

So, I've been going through making all the changes you and Derrick
suggested or highlighted...but I don't see how to tackle this one.
Perhaps I'm missing something.

Using your example of LOFS_BEGIN_TREE and LOFS_END_TREE from
list-objects-filter.c, I note that you handle it as part of
traverse_trees(), and thus you have a very natural "I'm going to
process this tree" point and "I'm done processing this tree" point.
There is no equivalent mapping to merge-ort that I can figure out.

merge-ort does use traverse_trees() in collect_merge_info(), and fills
opt->priv->paths with all full pathnames (both files and directories)
found in any of the three trees.  But I cannot process
files/directories at that time; rename detection needs
traverse_trees() to be finished to have all paths so far.  Further,
the list of pathnames from traverse_trees is not necessarily complete;
additional paths could be added by any of
  * Directory/file conflicts (need to move the file to a different location)
  * Directory/submodule conflicts (need to move something to a
different location)
  * Add/add conflicts of files of different types (e.g.
symlink/regular file; can't just content merge them with conflict
markers)
  * Directory rename detection (can move new files or even directories
on one side of history into a new directory on other side)

Thus, after traverse_trees() ends, my rename detection stuff can add
paths (including new directories), then process_entries() can add
paths -- and remove some when the resolution is to delete.  And the
code here in question runs as part of the process_entries() loop.

Now, we'd still be able to create synthetic BEGIN_TREE markers if we
operated in lexicographic ordering, but process_entries() *must*
operate in _reverse_ lexicographic ordering because:
  1) subtrees need to be written out before trees are; hashes of those
subtrees are used in the parent tree
  2) it's the only sane way to handle directory/file conflicts; I need
to know if all entries under the directory resolve to nothing; if not,
the directory is still in the way when it comes time to process the
file.

Granted, I could do some tricky calculations based on the reverse
lexicographic ordering of fullpaths (and their containing directories)
to figure out where trees begin and end -- but that takes us to
exactly what I *did* do.  It was precisely this step that you thought
should be made simpler, but I'm not seeing how to avoid it.

For now, I'll keep the code as-is, but add more comments to both the
data structure and the code.  If I've missed something about how I
could make use of your BEGIN_TREE idea, let me know and I'll look at
it again.