Re: [PATCH 0/3] batch blob diff generation

Junio C Hamano <gitster@xxxxxxxxx> · Sun, 15 Dec 2024 15:24:11 -0800

Junio C Hamano <gitster@xxxxxxxxx> writes:

> Jeff King <peff@xxxxxxxx> writes:
>
>> So ideally we'd have an input format that encapsulates that extra
>> context data and provides some mechanism for quoting. And it turns out
>> we do: the --raw diff format.
>
> Funny.  The raw diff format indeed was designed as an interchange
> format from various "compare two sets of things" front-ends (like
> diff-files, diff-cache, and diff-tree) that emits the raw format, to
> be read by "diff-helper" (initially called "diff-tree-helper") that
> takes the raw format and
>
>  - matches removed and added paths with similar contents to detect
>    renames and copies
>
>  - computes the output in various formats including "patch".
>
> So I guess we came a full circle, finally ;-).  Looking in the archive
> for messages exchanged between junkio@ and torvalds@ mentioning diff
> before 2005-05-30 finds some interesting gems.
>
> https://lore.kernel.org/git/7v1x8zsamn.fsf_-_@xxxxxxxxxxxxxxxxxxxxxxxx/

So, if we were to do what Justin tried to do honoring the overall
design of our diff machinery, I think what we can do is as follows:

 * Use the "diff --raw" output format as the input, but with a bit
   of twist.

   (1) a narrow special case that takes only a single diff_filepair
       of <old> and <new> blobs, and immediately run diff_queue() on
       that single diff_filepair, which is Justin's use case.  For
       this mode of operation, "flush after reach record of input"
       may be sufficient.

   (2) as a general "interchange format" to feed "comparison between
       two sets of <object, path>" into our diff machinery, we are
       better off if we can treat the input stream as multiple
       records that describes comparison between two sets.  Imagine
       "git log --oneline --first-parent -2 --raw HEAD", where one
       set of "diff --raw" records show the changed blobs with their
       paths between HEAD~1 and HEAD, and another set does so for
       HEAD~2 and HEAD~1.  We need to be able to tell where the
       first set ends and the second set starts, so that rename
       detection and other things, if requested, can be done within
       each set.

   My recommendation is to use a single blank line as a separator,
   e.g.

        :100644 100644 ce31f93061 9829984b0a M	Documentation/git-refs.txt
        :100644 100644 8b3882cff1 4a74f7c7bd M	refs.c
        :100755 100755 1bfff3a7af f59bc4860f M	t/t1460-refs-migrate.sh

        :100644 100644 c11213f520 8953d1c6d3 M	refs/files-backend.c
        :100644 100644 b2e3ba877d bec5962deb M	refs/reftable-backend.c

   so an application that wants to compare only one diff_filepair
   at a time would issue something like

        :100644 100644 ce31f93061 9829984b0a M	Documentation/git-refs.txt

        :100644 100644 8b3882cff1 4a74f7c7bd M	refs.c

        :100755 100755 1bfff3a7af f59bc4860f M	t/t1460-refs-migrate.sh

   so the parsing machinery does not have to worry about case (1) above.

 * Parse and append the input into diff_queue(), until you see an
   blank line.

   - If at EOF you are done, but if you have something accumulated
     in diff_queue(), show them (like below) first.  In any case, at
     EOF, you are done.

 * Run diffcore_std() followed by diff_flush() to have the contents
   of the queue nicely formatted and emptied.  Go back to parsing
   more input lines.