Re: [PATCH 0/3] batch blob diff generation

Jeff King <peff@xxxxxxxx> · Mon, 16 Dec 2024 06:30:42 -0500

On Sun, Dec 15, 2024 at 03:24:11PM -0800, Junio C Hamano wrote:

> > Funny.  The raw diff format indeed was designed as an interchange
> > format from various "compare two sets of things" front-ends (like
> > diff-files, diff-cache, and diff-tree) that emits the raw format, to
> > be read by "diff-helper" (initially called "diff-tree-helper") that
> > takes the raw format and
> >
> >  - matches removed and added paths with similar contents to detect
> >    renames and copies
> >
> >  - computes the output in various formats including "patch".
> >
> > So I guess we came a full circle, finally ;-).  Looking in the archive
> > for messages exchanged between junkio@ and torvalds@ mentioning diff
> > before 2005-05-30 finds some interesting gems.
> >
> > https://lore.kernel.org/git/7v1x8zsamn.fsf_-_@xxxxxxxxxxxxxxxxxxxxxxxx/

:) That same thread was linked when I posted the original diff-pairs
many years ago.

> So, if we were to do what Justin tried to do honoring the overall
> design of our diff machinery, I think what we can do is as follows:
> 
>  * Use the "diff --raw" output format as the input, but with a bit
>    of twist.
> 
>    (1) a narrow special case that takes only a single diff_filepair
>        of <old> and <new> blobs, and immediately run diff_queue() on
>        that single diff_filepair, which is Justin's use case.  For
>        this mode of operation, "flush after reach record of input"
>        may be sufficient.

My understanding was that he does not actually care about this case
(just feeding two blobs), but is actually processing --raw output in the
first place. Or did you just mean that we'd still be feeding raw output,
but just getting the flush behavior?

>    (2) as a general "interchange format" to feed "comparison between
>        two sets of <object, path>" into our diff machinery, we are
>        better off if we can treat the input stream as multiple
>        records that describes comparison between two sets.  Imagine
>        "git log --oneline --first-parent -2 --raw HEAD", where one
>        set of "diff --raw" records show the changed blobs with their
>        paths between HEAD~1 and HEAD, and another set does so for
>        HEAD~2 and HEAD~1.  We need to be able to tell where the
>        first set ends and the second set starts, so that rename
>        detection and other things, if requested, can be done within
>        each set.

Seems reasonable. For the use of diff-pairs at GitHub, I always just did
full-tree things like rename detection in the initial diff-tree
invocation. Since my goal was splitting/filtering, doing it after would
yield wrong answers (since diff-pairs never sees the complete set).

But it's possible for somebody to want to filter the intermediate
results, then do full-tree stuff on the result (or even just delay the
cost of rename detection). And certainly it's possible to want to feed a
whole bunch of unrelated diff segments without having to spawn a process
for each.

So it's not something I wanted, but I agree it's good to plan for.

>    My recommendation is to use a single blank line as a separator,
>    e.g.
> 
>         :100644 100644 ce31f93061 9829984b0a M	Documentation/git-refs.txt
>         :100644 100644 8b3882cff1 4a74f7c7bd M	refs.c
>         :100755 100755 1bfff3a7af f59bc4860f M	t/t1460-refs-migrate.sh
> 
>         :100644 100644 c11213f520 8953d1c6d3 M	refs/files-backend.c
>         :100644 100644 b2e3ba877d bec5962deb M	refs/reftable-backend.c
> 
>    so an application that wants to compare only one diff_filepair
>    at a time would issue something like
> 
>         :100644 100644 ce31f93061 9829984b0a M	Documentation/git-refs.txt
> 
>         :100644 100644 8b3882cff1 4a74f7c7bd M	refs.c
> 
>         :100755 100755 1bfff3a7af f59bc4860f M	t/t1460-refs-migrate.sh
> 
>    so the parsing machinery does not have to worry about case (1) above.

Yeah, that seems good. And it is backwards-compatible with the existing
diff-pairs format (which just barfs on a blank line). That's not a
big concern for the project, but it is nice that it makes life a bit
simpler for folks who would eventually be tasked with switching from it
to this new tool. ;)

>  * Parse and append the input into diff_queue(), until you see an
>    blank line.
> 
>    - If at EOF you are done, but if you have something accumulated
>      in diff_queue(), show them (like below) first.  In any case, at
>      EOF, you are done.

Yep, makes sense.

>  * Run diffcore_std() followed by diff_flush() to have the contents
>    of the queue nicely formatted and emptied.  Go back to parsing
>    more input lines.

That makes sense. I don't think my diff-pairs runs diffcore_std() at
all. The plumbing defaults mean it would always be a noop unless you
explicitly passed in rename, etc, options, and I never wanted to do
that.

We might have to check the interaction of diffcore code on a set of
queued diffs that already have values for renames, etc. I.e., that:

  git diff-tree --raw -M |
  git diff-pairs -M

does not barf, since the input step in diff-pairs is going to set status
to 'R', etc, in the pairs.

-Peff