Re: [PATCH 0/3] batch blob diff generation

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Dec 12, 2024 at 10:23:09PM -0600, Justin Tobler wrote:

> To enable support for batch diffs of multiple blob pairs, this
> series introduces a new diff plumbing command git-diff-blob(1). Similar
> to git-diff-tree(1), it provides a "--stdin" option that reads a pair of
> blobs on each line of input and generates the diffs. This is intended to
> be used for scripting purposes where more fine-grained control for diff
> generation is desired. Below is an example for each usage:
> 
>     $ git diff-blob HEAD~5000:README.md HEAD:README.md
> 
>     $ git diff-blob --stdin <<EOF
>     88f126184c52bfe4859ec189d018872902e02a84 665ce5f5a83647619fba9157fa9b0141ae8b228b
>     HEAD~5000:README.md HEAD:README.md
>     EOF

In the first example, I think just using "git diff" would work (though
it is not a plumbing command). But the stdin example is what's
interesting here anyway, since it can handle arbitrary inputs. So let's
focus on that.

Feeding just blob ids has a big drawback: we don't have any context! So
you get bogus filenames in the patch, no mode data, and so on.

Feeding the paths along with their commits, as you do on the second
line, gives you those things from the lookup context. But it also has
some problems. One, it's needlessly expensive; we have to traverse
HEAD~5000, and then dig into its tree to find the blobs (which
presumably you already did, since how else would you end up with those
oids). And two, there are parsing ambiguities, since arbitrary revision
names can contain spaces. E.g., are we looking for the file "README.md
HEAD:README.md" in HEAD~5000?

So ideally we'd have an input format that encapsulates that extra
context data and provides some mechanism for quoting. And it turns out
we do: the --raw diff format.

If the program takes that format, then you can manually feed it two
arbitrary blob oids if you have them (and put whatever you like for the
mode/path context), like:

  git diff-blob --stdin <<\EOF
  :100644 100644 88f126184c52bfe4859ec189d018872902e02a84 665ce5f5a83647619fba9157fa9b0141ae8b228b M	README.md
  EOF

Or you can get the real context yourself (though it seems to me that
this is a gap in what "cat-file --batch" should be able to do in a
single process):

  git ls-tree HEAD~5000 README.md >out
  read mode_a blob oid_a path <out
  git ls-tree HEAD README.md >out
  read mode_b blob oid_b path <out
  printf ":$mode_a $mode_b $oid_a $oid_b M\tREADME.md" |
  git diff-blob --stdin

But it also means you can use --raw output directly. So:

  git diff-tree --raw -r HEAD~5000 HEAD -- README.md |
  git diff-blob --stdin

Now that command by itself doesn't look all that useful; you could have
just asked for patches from diff-tree. But by splitting the two, you can
filter the set of paths in between (for example, to omit some entries,
or to batch a large diff into more manageable chunks for pagination,
etc).

The patch might look something like this:

  https://lore.kernel.org/git/20161201204042.6yslbyrg7l6ghhww@xxxxxxxxxxxxxxxxxxxxx/

:) That is what has been powering the diffs at github.com since 2016 or
so. And continues to do so, as far as I know. I don't have access to
their internal repository anymore, but I've continued to rebase the
topic forward in my personal repo. You can fetch it from:

  https://github.com/peff/git jk/diff-pairs

in case that is helpful.

-Peff




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux