Re: [PATCH v4 2/3] git diff: improve range handling

Chris Torek <chris.torek@xxxxxxxxx> · Fri, 12 Jun 2020 12:25:21 -0700

Ugh, forgot to tweak gmail reply to go to the mailing list.  Also
I typed in a wrong word ("commit" should be "comment").

Corrected reply:

On Fri, Jun 12, 2020 at 11:31 AM Junio C Hamano <gitster@xxxxxxxxx> wrote:
>
> "Chris Torek via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes:
>
> > +struct symdiff {
> > +     struct bitmap *skip;
> > +     int warn;
> > +     const char *base, *left, *right;
> > +};
> > +
> > +/*
> > + * Check for symmetric-difference arguments, and if present, arrange
> > + * everything we need to know to handle them correctly.  As a bonus,
> > + * weed out all bogus range-based revision specifications, e.g.,
> > + * "git diff A..B C..D" or "git diff A..B C" get rejected.
> > + *
> > + * For an actual symmetric diff, *symdiff is set this way:
> > + *
> > + *  - its skip is non-NULL and marks *all* rev->pending.objects[i]
> > + *    indices that the caller should ignore (extra merge bases, of
> > + *    which there might be many, and A in A...B).  Note that the
> > + *    chosen merge base and right side are NOT marked.
> > + *  - warn is set if there are multiple merge bases.
> > + *  - base, left, and right point to the names to use in a
> > + *    warning about multiple merge bases.
> > + *
> > + * If there is no symmetric diff argument, sym->skip is NULL and
> > + * sym->warn is cleared.  The remaining fields are not set.
> > + */
>
> OK.
>
> > +static void symdiff_prepare(struct rev_info *rev, struct symdiff *sym)
> > +{
> > +     int i, is_symdiff = 0, basecount = 0, othercount = 0;
> > +     int lpos = -1, rpos = -1, basepos = -1;
> > +     struct bitmap *map = NULL;
> > +
> > +     /*
> > +      * Use the whence fields to find merge bases and left and
> > +      * right parts of symmetric difference, so that we do not
> > +      * depend on the order that revisions are parsed.  If there
> > +      * are any revs that aren't from these sources, we have a
> > +      * "git diff C A...B" or "git diff A...B C" case.  Or we
> > +      * could even get "git diff A...B C...E", for instance.
> > +      *
> > +      * If we don't have just one merge base, we pick one
> > +      * at random.
> > +      *
> > +      * NB: REV_CMD_LEFT, REV_CMD_RIGHT are also used for A..B,
> > +      * so we must check for SYMMETRIC_LEFT too.  The two arrays
> > +      * rev->pending.objects and rev->cmdline.rev are parallel.
> > +      */
> > +     for (i = 0; i < rev->cmdline.nr; i++) {
> > +             struct object *obj = rev->pending.objects[i].item;
> > +             switch (rev->cmdline.rev[i].whence) {
> > +             case REV_CMD_MERGE_BASE:
> > +                     if (basepos < 0)
> > +                             basepos = i;
> > +                     basecount++;
> > +                     break;          /* do mark all bases */
>
> We find and use the first found merge base (i.e. "pick at random" as
> promised in the comment before the function), but for warning, keep
> track of how many merge bases there are.
>
> > +             case REV_CMD_LEFT:
> > +                     if (lpos >= 0)
> > +                             usage(builtin_diff_usage);
>
> A range (either A..B or A...B) has already been seen, and we have
> another, which is now rejected.
>
> > +                     lpos = i;
> > +                     if (obj->flags & SYMMETRIC_LEFT) {
> > +                             is_symdiff = 1;
> > +                             break;  /* do mark A */
>
> Inside this switch statement, "continue" is a sign that the caller
> should use the rev, and "break" is a sign that the rev is to be
> ignored.  We obviously do not ignore "A" in ...
>
> > +                     }
> > +                     continue;
>
> ... "A..B" notation, so we "continue" here.
>
> > +             case REV_CMD_RIGHT:
> > +                     if (rpos >= 0)
> > +                             usage(builtin_diff_usage);
>
> Here is the same "we reject having two or more ranges".
>
> I actually suspect that this usage() would become dead code---we
> would already have died when we saw the matching left end of the
> second range (so this could become BUG(), even though usage() does
> not hurt).

Right - I considered it both ways and figured a second usage() was
simpler and straightforward, but I'm fine with either method.

> > +                     rpos = i;
> > +                     continue;       /* don't mark B */
>
> And of course, whether "A..B" or "A...B", B will be used as the
> "result" side of the diff, so won't be marked for skipping.
>
> > +             case REV_CMD_PARENTS_ONLY:
> > +             case REV_CMD_REF:
> > +             case REV_CMD_REV:
> > +                     othercount++;
> > +                     continue;
>
> I wonder if we want to use "default" instead of these three
> individual cases.  Pros and cons?
>
>  - If we forgot to list a whence REV_CMD_* here, it will be silently
>    marked to be skipped with this code.  With "default", it will be
>    counted to be diffed (which may trigger "giving too many revs to
>    be diffed" error from the diff machinery, which is good).
>
>  - With "default", when we add new type of whence to REV_CMD_* and
>    forget to adjust this code, it will be counted to be diffed.
>    With the current code, it will be skipped.
>
> We probably could get the best of the both words by keeping the
> above three for "counted in othercount and kept), and then add a
> default arm to the switch() that just says
>
>                 default:
>                         BUG("forgot to handle %d",
>                             rev->cmdline.rev[i].whence);

I'm fine with this as well.  I did it the way I did because at least some
compilers give a warning or error if you forgot an enum.  Using default
(or adding one) defeats this, but the BUG method is reasonable.

> That way, every time we add a new type of whence, we would be forced
> to think what should be done to them.
>
> > +             }
> > +             if (map == NULL)
> > +                     map = bitmap_new();
> > +             bitmap_set(map, i);
> > +     }
> > +
> > +     /*
> > +      * Forbid any additional revs for both A...B and A..B.
> > +      */
> > +     if (lpos >= 0 && othercount > 0)
> > +             usage(builtin_diff_usage);
>
> Meaning "git diff A..B C" is bad.  Reasonable.
>
> > +     if (!is_symdiff) {
> > +             bitmap_free(map);
>
> It is not wrong per-se to free it unconditionally, but wouldn't it
> be a bug if (map != NULL) at this point in the flow?

Yes, because the A in A...B will have been marked.  It didn't
seem worth a BUG() call though.

> The merge bases would only be stuffed in the revs when A...B is
> given, and we are not skipping anything involved in A..B or
> non-range revs.
>
> > +             sym->warn = 0;
> > +             sym->skip = NULL;
>
> Clearing these two fields are as promised to the callers in the
> comment above, which is good.
>
> > +             return;
> > +     }
> > +
> > +     sym->left = rev->pending.objects[lpos].name;
> > +     sym->right = rev->pending.objects[rpos].name;
> > +     sym->base = rev->pending.objects[basepos].name;
> > +     if (basecount == 0)
> > +             die(_("%s...%s: no merge base"), sym->left, sym->right);
>
> Good.
>
> > +     bitmap_unset(map, basepos);     /* unmark the base we want */
>
> Hmph.  You could
>
>         case REV_CMD_MERGE_BASE:
>                 basecount++;
>                 if (basepos < 0) {
>                         basepos = i;
>                         continue; /* keep this one */
>                 }
>                 break; /* skip all others */
>
> and lose this unset().  I do not think it makes too much of a
> difference, but it probably is easier to follow if we avoided this
> "do something and then come back to correct" pattern.

OK, I'll change it to do this in the re-roll.

> > +     sym->warn = basecount > 1;
> > +     sym->skip = map;
> > +}
> > +
> >  int cmd_diff(int argc, const char **argv, const char *prefix)
> >  {
> >       int i;
> > @@ -263,6 +366,7 @@ int cmd_diff(int argc, const char **argv, const char *prefix)
> >       struct object_array_entry *blob[2];
> >       int nongit = 0, no_index = 0;
> >       int result = 0;
> > +     struct symdiff sdiff;
> >
> >       /*
> >        * We could get N tree-ish in the rev.pending_objects list.
> > @@ -382,6 +486,7 @@ int cmd_diff(int argc, const char **argv, const char *prefix)
> >               }
> >       }
> >
> > +     symdiff_prepare(&rev, &sdiff);
> >       for (i = 0; i < rev.pending.nr; i++) {
> >               struct object_array_entry *entry = &rev.pending.objects[i];
> >               struct object *obj = entry->item;
> > @@ -396,6 +501,8 @@ int cmd_diff(int argc, const char **argv, const char *prefix)
> >                       obj = &get_commit_tree(((struct commit *)obj))->object;
> >
> >               if (obj->type == OBJ_TREE) {
> > +                     if (sdiff.skip && bitmap_get(sdiff.skip, i))
> > +                             continue;
>
> By the way, I cannot shake this feeling that, given that
> rev.pending/cmdline.nr will not be an unreasonably large number, if
> it is overkill to use the bitmap here.  If I were writing this code,
> I would have made symdiff_prepare() to fill a separate object array
> by copying the elements to be used in the final "diff" out of the
> rev.pending array and updated this loop to iterate over that array.
>
> I am not saying that such an approach is better than the use of
> bitmap code here.  It just was a bit unexpected to see the bitmap
> code used for set of objects that is typically less than a dozen.

I am sure it is overkill -- but at the same time, it's already coded
and cheap enough to use that rolling our own separate array felt
worse.  I can add a comment (NOT COMMIT) if you like.

Thanks,

Chris