Re: [PATCH 03/18] branch-diff: first rudimentary implementation

Johannes Schindelin <Johannes.Schindelin@xxxxxx> · Thu, 3 May 2018 23:01:33 +0200 (DST)

Hi Stefan,

On Thu, 3 May 2018, Stefan Beller wrote:

> On Thu, May 3, 2018 at 8:30 AM, Johannes Schindelin
> <johannes.schindelin@xxxxxx> wrote:
> 
> > Note: due to differences in the diff algorithm (`tbdiff` uses the
> > Pythong module `difflib`, Git uses its xdiff fork), the cost matrix
> > calculated by `branch-diff` is different (but very similar) to the one
> > calculated by `tbdiff`. Therefore, it is possible that they find
> > different matching commits in corner cases (e.g. when a patch was
> > split into two patches of roughly equal length).
> 
> Does that mean, we may want to tweak the underlying diff parameters for
> this special use case eventually?

I don't think that will be necessary. Generating diffs is an ambiguous
business, after all, and we just have to live with the consequence that it
might even be possible that the cost is non-symmetric, i.e. that the
length (i.e. line count) of the diff is different depending whether we
compare old patch to new patch vs new patch to old patch.

If the result changes due to these vagaries, it means that there is no
single one good answer to the question which old/new commits form a pair.
I would expect that only to happen if a commit with a lengthy diff is cut
into two commits whose diffs have roughly equal lengths (so that the
difference of the commit message won't matter that much).

> > -#define COLOR_DUAL_MODE 2
> > -
> 
> Leave this out in the first patch?

Yep, as Ramsay said.

> > @@ -19,6 +23,279 @@ static int parse_creation_weight(const struct option *opt, const char *arg,
> >         return 0;
> >  }
> >
> > +struct patch_util {
> > +       /* For the search for an exact match */
> > +       struct hashmap_entry e;
> > +       const char *diff, *patch;
> > +
> > +       int i;
> > +       int diffsize;
> > +       size_t diff_offset;
> > +       /* the index of the matching item in the other branch, or -1 */
> > +       int matching;
> > +       struct object_id oid;
> > +};
> > +
> > +/*
> > + * Reads the patches into a string list, with the `util` field being populated
> > + * as struct object_id (will need to be free()d).
> > + */
> > +static int read_patches(const char *range, struct string_list *list)
> > +{
> > +       struct child_process cp = CHILD_PROCESS_INIT;
> > +       FILE *in;
> > +       struct strbuf buf = STRBUF_INIT, line = STRBUF_INIT;
> > +       struct patch_util *util = NULL;
> > +       int in_header = 1;
> > +
> > +       argv_array_pushl(&cp.args, "log", "--no-color", "-p", "--no-merges",
> > +                       "--reverse", "--date-order", "--decorate=no",
> > +                       "--no-abbrev-commit", range,
> > +                       NULL);
> > +       cp.out = -1;
> > +       cp.no_stdin = 1;
> > +       cp.git_cmd = 1;
> > +
> > +       if (start_command(&cp))
> > +               return error_errno(_("could not start `log`"));
> > +       in = fdopen(cp.out, "r");
> > +       if (!in) {
> > +               error_errno(_("could not read `log` output"));
> > +               finish_command(&cp);
> > +               return -1;
> > +       }
> 
> With the implementation of --color-moved, there is an option
> to buffer all diff output in memory e6e045f8031 (diff.c: buffer
> all output if asked to, 2017-06-29), so I posit that running this
> diff in-core may be "not too hard". Famous last words.

True. I *did* stumble over emitted_symbols and thought that there might be
a way to leverage it, but I did not find any existing knob, and did not
want to interfere with any other user case of that feature (which we may
want to allow combining with branch-diff one day, afer all).

To be honest, the main reason I spawn here is that I did not want to be
bothered with resetting the commit flags after traversing the first commit
range. But I guess I was just too cheap and should really do it?

OTOH spawning here is a lot easier than not spawning, so maybe it would be
premature optimization?

> In addition to that patch, we'd have to buffer commit messages
> and buffer multiple commits, as that only buffers a diff of a single
> commit.

... and make sure that the moved-code logic (which is currently the only
user of emitted_symbols, correct?) would never be called at the same time
as we generate the diff.

> The benefit would be no invocation of new processes, letting us
> do more in core. This would allow for tweaking revision walking
> internally, e.g. passing of options to this command such as rename
> detection factors, can be passed through easily without the need
> of translating it back to the command line.

On the other hand, we can simply copy those options to the command-line
for `log`. Which might even be better, as e.g. `--format` changes global
state :-(

> > +
> > +               if (starts_with(line.buf, "diff --git")) {
> 
> When using the internal buffers, you would not need to
> string compare, but could just check for the
> DIFF_SYMBOL_HEADER.

True. Not sure whether the current way is *that* terrible, though, as the
`diff` line is meant for parsing by other commands, anyway.

> > +               } else if (starts_with(line.buf, "@@ "))
> > +                       strbuf_addstr(&buf, "@@");
> 
> So we omit line information for hunks. Makes sense,
> though then we could also skip the "index ..." lines?

And we do, in the next line:

> > +               else if (line.buf[0] && !starts_with(line.buf, "index "))

Ciao,
Dscho