Re: [PATCH 08/12] merge-ort: provide a merge_get_conflicted_files() helper function

Elijah Newren <newren@xxxxxxxxx> · Fri, 28 Jan 2022 22:08:05 -0800

On Fri, Jan 28, 2022 at 8:55 AM Johannes Schindelin
<Johannes.Schindelin@xxxxxx> wrote:
>
> Hi Elijah,
>
> On Sat, 22 Jan 2022, Elijah Newren via GitGitGadget wrote:
>
> > From: Elijah Newren <newren@xxxxxxxxx>
> >
> > After a merge, this function allows the user to extract the same
> > information that would be printed by `ls-files -u` -- conflicted
> > files with their mode, oid, and stage.
>
> Hmm. Okay.
>
> I am not really a fan of that output where we use a variable
> number of lines per file. As in: in the regular case, where a file was
> modified in divergent ways, we get three lines. But if it was deleted in
> one branch, or if it was created in both branches, we have only two lines.
>
> Frankly, I'd much rather have 3 lines for each and every conflict.

Out of curiosity, does this also mean you feel like `git ls-files -u`
is mis-designed (even if we're stuck with it for backward
compatibility reasons)?

Anyway, if we make this change guaranteeing there are 3 lines for each
and every conflict, that probably implies we can remove the stage
field (just letting it be implicit), right?  And it also implies that
the "path" field of the lines is now duplicate information.  Should
the path be printed on each line regardless of the duplication?
Should the path be printed separately on its own line, followed by the
three lines of (mode, oid) pairs?  Or should the format be something
else entirely?

> Granted, we currently only have three stages...

I thought that was true too, but Junio informed me otherwise[1].
Granted, ort does not currently make use of that, but it could.  Not
sure I want to rule it out.

[1] https://lore.kernel.org/git/xmqqr1t89gi8.fsf@xxxxxxxxxxxxxxxxxxxxxx/

> ...and we can pretty much
> guarantee that at least two of these stages are non-empty (otherwise where
> would be the conflict?).

No, that's not true.  See the paragraph about "conflict types" from
the final patch in this series, where I mentioned three different
types of conflicts that can result in a path having a single higher
order stage.

> Meaning: Even if stage 3 is missing from the first conflict and stage 1 is
> missing from the second conflict, in the output we would see stages 1, 2,
> 2, 3, i.e. a duplicate stage 2, signifying that we're talking about two
> different conflicts.

I don't understand why you're fixating on the stage here.  Why would
you want to group all the stage 2s together, count them up, and then
determine there are N conflicting files because there are N stage 2's?
 In such a design, you'd have to do the extra post-processing to
recognize the all-zero modes and hashes and toss them.

To me, that seems like more work than just handling the `ls-files -u`
style of output, and doesn't have the advantage of showing things in a
format users may already be familiar with (see above about dropping
stages and maybe also pathname).  Since the `ls-files -u` output is
simply several lines of:
   <mode> <hash> <stage> <filename>
You can get the number of conflicted files by counting the number of
*unique* filenames.  If you want to see which lines are about the same
file, get all the lines with the same filename -- they are sorted by
(<filename>, <stage>) for convenience, much like `ls-files -u` is.

On a different angle, I'm also slightly worried there's an embedded
assumption here, one that I tried to address in the "Mistake to avoid"
section I added in my last patch of this series.  What if you have a
conflicting merge and 0 lines of output in the conflicting info
section?  Is there something about the reason you're asking for three
lines of output per conflicted file which is going to make you
overlook that particular possibility and not handle it?

> But still. It makes me uneasy to have that much variability in
> machine-consumable output.
>
> And who knows, maybe if we're ever implementing more sophisticated merge
> strategies for octopus merges, we might end up with more stages...

Um, if we need to handle more stages, wouldn't that undercut your
request for exactly 3 stages?  Wouldn't it suggest that we would
indeed want to have a flexible number of lines?

> In other words...
>
> > Signed-off-by: Elijah Newren <newren@xxxxxxxxx>
> > ---
> >  merge-ort.c | 31 +++++++++++++++++++++++++++++++
> >  merge-ort.h | 21 +++++++++++++++++++++
> >  2 files changed, 52 insertions(+)
> >
> > diff --git a/merge-ort.c b/merge-ort.c
> > index b78dde55ad9..5e7cea6cc8f 100644
> > --- a/merge-ort.c
> > +++ b/merge-ort.c
> > @@ -4275,6 +4275,37 @@ void merge_display_update_messages(struct merge_options *opt,
> >       trace2_region_leave("merge", "display messages", opt->repo);
> >  }
> >
> > +void merge_get_conflicted_files(struct merge_result *result,
> > +                             struct string_list *conflicted_files)
> > +{
> > +     struct hashmap_iter iter;
> > +     struct strmap_entry *e;
> > +     struct merge_options_internal *opti = result->priv;
> > +
> > +     strmap_for_each_entry(&opti->conflicted, &iter, e) {
> > +             const char *path = e->key;
> > +             struct conflict_info *ci = e->value;
> > +             int i;
> > +
> > +             VERIFY_CI(ci);
> > +
> > +             for (i = MERGE_BASE; i <= MERGE_SIDE2; i++) {
> > +                     struct stage_info *si;
> > +
> > +                     if (!(ci->filemask & (1ul << i)))
> > +                             continue;
> > +
> > +                     si = xmalloc(sizeof(*si));
> > +                     si->stage = i+1;
> > +                     si->mode = ci->stages[i].mode;
> > +                     oidcpy(&si->oid, &ci->stages[i].oid);
> > +                     string_list_append(conflicted_files, path)->util = si;
> > +             }
> > +     }
> > +     /* string_list_sort() uses a stable sort, so we're good */
> > +     string_list_sort(conflicted_files);
> > +}
> > +
> >  void merge_switch_to_result(struct merge_options *opt,
> >                           struct tree *head,
> >                           struct merge_result *result,
> > diff --git a/merge-ort.h b/merge-ort.h
> > index d643b47cb7c..e635a294ea8 100644
> > --- a/merge-ort.h
> > +++ b/merge-ort.h
> > @@ -2,6 +2,7 @@
> >  #define MERGE_ORT_H
> >
> >  #include "merge-recursive.h"
> > +#include "hash.h"
> >
> >  struct commit;
> >  struct tree;
> > @@ -89,6 +90,26 @@ void merge_display_update_messages(struct merge_options *opt,
> >                                  struct merge_result *result,
> >                                  FILE *stream);
> >
> > +struct stage_info {
> > +     struct object_id oid;
> > +     int mode;
> > +     int stage;
> > +};
>
> ... I'd rather not tack this onto a `string_list` but instead have
> something like:
>
>         struct conflict_info {

Nit: "conflict_info" is already taken for another structure; we'd need
a different name.  But it does illustrate the idea just fine.

>                 struct {
>                         struct object_id oid;
>                         int mode;
>                         const char *path;
>                 } stages[3]
>         };
>
> where `path` can be `NULL` to indicate that this stage is missing.

I'll get back to the data structure in a second, but the note about
path being `NULL` is interesting.  I'm presuming that oid and mode are
all-zeros as well, yes?  Now, your original request was that you
wanted a line printed for all three stages.  If we're printing oid,
mode, and path for each stage, what do we print for the path one these
lines?  Is it (1) "(null)", (2) "", or (3) the real pathname (via
carefully comparing to surrounding stages to find out what it is)?  If
we're changing the format to only print the name of the path once,
does the code need to loop over the list of conflicts in order to find
out the name (since the first or second stage might have a NULL path
field)?

In regards to the overall data structure and your comment about
string_list: I'm slightly confused.  You say you want to avoid a
string_list, but you've only accounted for there being one conflict in
this data structure.  I don't see how this suggestion avoids the need
for an array or a string_list or some kind of container to hold
multiple of these.

The inclusion of path within the stages array is also interesting --
should there be (up to) three copies of the pathname allocated per
conflict?  Seems a bit wasteful.  Wouldn't it make more sense to have
something like

    struct conflicts {
        struct {
            struct object_id oid;
            unsigned short mode;
        } stages[3]
    };

and then have a string_list storing the pathname -> conflicts mappings
to avoid the pathname duplication?  Or is there something you really
find problematic about the string_list and you really want an array or
other data structure?