Re: [PATCH 0/3] batch blob diff generation

Justin Tobler <jltobler@xxxxxxxxx> · Fri, 13 Dec 2024 10:41:25 -0600

On 24/12/13 03:12AM, Jeff King wrote:

> In the first example, I think just using "git diff" would work (though
> it is not a plumbing command). But the stdin example is what's
> interesting here anyway, since it can handle arbitrary inputs. So let's
> focus on that.
> 
> Feeding just blob ids has a big drawback: we don't have any context! So
> you get bogus filenames in the patch, no mode data, and so on.
> 
> Feeding the paths along with their commits, as you do on the second
> line, gives you those things from the lookup context. But it also has
> some problems. One, it's needlessly expensive; we have to traverse
> HEAD~5000, and then dig into its tree to find the blobs (which
> presumably you already did, since how else would you end up with those
> oids). And two, there are parsing ambiguities, since arbitrary revision
> names can contain spaces. E.g., are we looking for the file "README.md
> HEAD:README.md" in HEAD~5000?
> 
> So ideally we'd have an input format that encapsulates that extra
> context data and provides some mechanism for quoting. And it turns out
> we do: the --raw diff format.

I had not considered using the raw diff format as the input source. As
you pointed out, using blob IDs alone loses some of the useful context.
By using path-scoped revisions, we can still get this context, but at an
added cost of having to traverse the tree to get the underlying
information.

As you also mentioned, this is potentially wasteful if, for example, the
blobs diffs you are trying to generate are a subset of git-diff-tree(1)
output and thus the context is already known ahead of time. Which is
exactly what we are hoping to accomplish.

> If the program takes that format, then you can manually feed it two
> arbitrary blob oids if you have them (and put whatever you like for the
> mode/path context), like:
> 
>   git diff-blob --stdin <<\EOF
>   :100644 100644 88f126184c52bfe4859ec189d018872902e02a84 665ce5f5a83647619fba9157fa9b0141ae8b228b M	README.md
>   EOF
> 
> Or you can get the real context yourself (though it seems to me that
> this is a gap in what "cat-file --batch" should be able to do in a
> single process):
> 
>   git ls-tree HEAD~5000 README.md >out
>   read mode_a blob oid_a path <out
>   git ls-tree HEAD README.md >out
>   read mode_b blob oid_b path <out
>   printf ":$mode_a $mode_b $oid_a $oid_b M\tREADME.md" |
>   git diff-blob --stdin
> 
> But it also means you can use --raw output directly. So:
> 
>   git diff-tree --raw -r HEAD~5000 HEAD -- README.md |
>   git diff-blob --stdin
> 
> Now that command by itself doesn't look all that useful; you could have
> just asked for patches from diff-tree. But by splitting the two, you can
> filter the set of paths in between (for example, to omit some entries,
> or to batch a large diff into more manageable chunks for pagination,
> etc).

Yup, this is exactly what I'm hoping to achieve! As a single commit may
contain an unbounded number changes, being able to control diff
generation at the blob level is quite useful. Using the raw diff format
as input looks like a rather elegant solution and I think it makes
sense to use it here for the "--stdin" option over just reading two
blobs.

> The patch might look something like this:
> 
>   https://lore.kernel.org/git/20161201204042.6yslbyrg7l6ghhww@xxxxxxxxxxxxxxxxxxxxx/
> 
> :) That is what has been powering the diffs at github.com since 2016 or
> so. And continues to do so, as far as I know. I don't have access to
> their internal repository anymore, but I've continued to rebase the
> topic forward in my personal repo. You can fetch it from:
> 
>   https://github.com/peff/git jk/diff-pairs
> 
> in case that is helpful.

Thanks Peff! From looking at the mentioned thread and branch, it looks
like I'm basically trying to accomplish the same thing here. Just a bit
late to the conversation. :)

While the use-case is rather narrow, I think it would be nice to see
this functionality provided upstream. I see this as a means to faciliate
more fine-grained control of the blob diffs we actually want to compute
at a given time and it seems like it would be reasonable to expose as
part of the diff plumbing. I would certainly be interested in adapting
this series to instead use raw input from git-diff-tree(1) or trying to
revive the previous series if that is preferred. 

If there is interest in continuing, some lingering questions I have:

Being that the primary purpose of git-diff-blob(1) here is to handle
generating blob diffs as specified by stdin, is there any reason to have
a normal mode that accepts a blob pair as arguments? Or would it be best
to limit the input mechanism to stdin entirely? If the user wanted to
compute a single blob diff they could just use git-diff(1) already so
providing this as a part of git-diff-blob(1) is a bit redundant. Having
it as an option for the user does seem a bit more friendly though.

-Justin