Hi Justin
On 12/02/2025 04:18, Justin Tobler wrote:
Through git-diff(1), a single diff can be generated from a pair of blob
revisions directly. Unfortunately, there is not a mechanism to compute
batches of specific file pair diffs in a single process. Such a feature
is particularly useful on the server-side where diffing between a large
set of changes is not feasible all at once due to timeout concerns.
To facilitate this, introduce git-diff-pairs(1) which takes the
null-terminated raw diff format as input on stdin and produces diffs in
other formats. As the raw diff format already contains the necessary
metadata, it becomes possible to progressively generate batches of diffs
without having to recompute rename detection or retrieve object context.
Something like the following:
git diff-tree -r -z -M $old $new |
git diff-pairs -p
should generate the same output as `git diff-tree -p -M`. Furthermore,
each line of raw diff formatted input can also be individually fed to a
separate git-diff-pairs(1) process and still produce the same output.
I like the idea of this, I've left a few comments mainly around the UI.
> +Here's an incomplete list of things that `diff-pairs` could do, but
> +doesn't (mostly in the name of simplicity):
> +
> + - Only `-z` input is accepted, not normal `--raw` input.
I think only accepting NUL terminated input is fine, but if we want to
accept other formats we should have a plan for how to do that in a
backwards compatible way as we cannot use `-z` to distinguish between
input formats.
+ const char * const usage[] = {
+ N_("git diff-pairs [diff-options]"),
Normally the option summary printed by "git foo -h" is generated by the
option parser. In this case we don't define any options and use
setup_revisions() instead so we need to provide the option summary
ourselves. Looking at diff-files.c we can add
"\n"
COMMON_DIFF_OPTIONS_HELP;
to do that.
+ argc = setup_revisions(argc, argv, &revs, NULL);
I think we should check that there are no options left on the
commandline after setup_revisions() returns
+ /* Don't allow pathspecs at all. */
+ if (revs.prune_data.nr)
+ usage_with_options(usage, options);
It is not just pathspecs that we want to reject but all revision related
options. Looking at diff-files.c we can do
if (rev.pending.nr ||
rev.min_age != -1 || rev.max_age != -1 ||
rev.max_count != -1)
usage_with_option(usage, options);
To catch some of that but it still accepts things like "--first-parent",
"--merges" and "--ancestry-path". We may just have to live with that as
I don't think it is worth expanding a huge amount of effort to prevent them.
+ if (!revs.diffopt.output_format)
+ revs.diffopt.output_format = DIFF_FORMAT_RAW;
This matches the other diff plumbing commands but I'm not sure it is the
most helpful default for a command that is supposed to transform raw
diffs into another format. Maybe we should default to DIFF_FORMAT_PATCH?
+test_expect_success 'split input across multiple diff-pairs' '
This needs a PERL prerequisite I think. I'm a bit unsure what this test
adds compared to the others.
Best Wishes
Phillip
+ write_script split-raw-diff "$PERL_PATH" <<-\EOF &&
+ $/ = "\0";
+ while (<>) {
+ my $meta = $_;
+ my $path = <>;
+ # renames have an extra path
+ my $path2 = <> if $meta =~ /[RC]\d+/;
+
+ open(my $fh, ">", sprintf "diff%03d", $.);
+ print $fh $meta, $path, $path2;
+ }
+ EOF
+
+ git diff-tree -p -M -C -C base new >expect &&
+
+ git diff-tree -r -z -M -C -C base new |
+ ./split-raw-diff &&
+ for i in diff*; do
+ git diff-pairs -p <$i || return 1
+ done >actual &&
+ test_cmp expect actual
+'
+
+test_done