On Tue, Feb 25, 2025 at 05:39:24PM -0600, Justin Tobler wrote: > Through git-diff(1), a single diff can be generated from a pair of blob > revisions directly. Unfortunately, there is not a mechanism to compute > batches of specific file pair diffs in a single process. Such a feature > is particularly useful on the server-side where diffing between a large > set of changes is not feasible all at once due to timeout concerns. > > To facilitate this, introduce git-diff-pairs(1) which acts as a backend > passing its NUL-terminated raw diff format input from stdin through diff > machinery to produce various forms of output such as patch or raw. > > The raw format was originally designed as an interchange format and > represents the contents of the diff_queue_diff list making it possible > to break the diff pipeline into separate stages. For example, > git-diff-tree(1) can be used as a frontend to compute file pairs to > queue and feed its raw output to git-diff-pairs(1) to compute patches. > With this, batches of diffs can be progessively generated without having > to recompute rename detection or retrieve object context. Something like s/rename detection/renames/ > diff --git a/builtin/diff-pairs.c b/builtin/diff-pairs.c > new file mode 100644 > index 0000000000..9472b10461 > --- /dev/null > +++ b/builtin/diff-pairs.c > @@ -0,0 +1,193 @@ > +#include "builtin.h" > +#include "commit.h" > +#include "config.h" > +#include "diff.h" > +#include "diffcore.h" > +#include "gettext.h" > +#include "hex.h" > +#include "object.h" > +#include "parse-options.h" > +#include "revision.h" > +#include "strbuf.h" > + > +static unsigned parse_mode_or_die(const char *mode, const char **endp) > +{ > + uint16_t ret; > + > + *endp = parse_mode(mode, &ret); > + if (!*endp) > + die(_("unable to parse mode: %s"), mode); > + return ret; > +} > + > +static void parse_oid_or_die(const char *p, struct object_id *oid, > + const char **endp, const struct git_hash_algo *algop) > +{ > + if (parse_oid_hex_algop(p, oid, endp, algop) || *(*endp)++ != ' ') > + die(_("unable to parse object id: %s"), p); > +} > + > +static void flush_diff_queue(struct diff_options *options) > +{ > + /* > + * If rename detection is not requested, use rename information from the > + * raw diff formatted input. Setting found_follow ensures diffcore_std() > + * does not mess with rename information already present in queued > + * filepairs. > + */ > + if (!options->detect_rename) > + options->found_follow = 1; It's a bit weird that we set this over here. Shouldn't we have set it up in the main function already? > + diffcore_std(options); > + diff_flush(options); > +} > + > +int cmd_diff_pairs(int argc, const char **argv, const char *prefix, > + struct repository *repo) > +{ > + struct strbuf path_dst = STRBUF_INIT; > + struct strbuf path = STRBUF_INIT; > + struct strbuf meta = STRBUF_INIT; > + struct rev_info revs; > + int ret; > + > + const char * const usage[] = { > + N_("git diff-pairs -z [<diff-options>]"), > + NULL > + }; > + struct option options[] = { > + OPT_END() > + }; > + struct option *parseopts = add_diff_options(options, &revs.diffopt); > + > + show_usage_with_options_if_asked(argc, argv, usage, parseopts); Don't we also have to call `parse_options()` even though we don't have our own options yet? Or is this all handled by `setup_revisions()`? > + repo_init_revisions(repo, &revs, prefix); > + repo_config(repo, git_diff_basic_config, NULL); > + revs.disable_stdin = 1; > + revs.abbrev = 0; > + revs.diff = 1; > + > + if (setup_revisions(argc, argv, &revs, NULL) > 1) > + usage_with_options(usage, parseopts); I think it's discouraged nowadays to use `usage_with_options()` as it generates a ton of noise while hiding the actual error message. It is instead recommended to directly call `usage()` with an error message. In this case here we would say e.g. `usage(_("unrecognized argument: %s"), argv[0])`, in the cases below we'd use the error messages you already have. > + > + /* > + * With the -z option, both command input and raw output are > + * NUL-delimited (this mode does not effect patch output). At present > + * only NUL-delimited raw diff formatted input is supported. > + */ > + if (revs.diffopt.line_termination) { > + error(_("working without -z is not supported")); > + usage_with_options(usage, parseopts); > + } > + > + if (revs.prune_data.nr) { > + error(_("pathspec arguments not supported")); > + usage_with_options(usage, parseopts); > + } > + > + if (revs.pending.nr || revs.max_count != -1 || > + revs.min_age != (timestamp_t)-1 || > + revs.max_age != (timestamp_t)-1) { > + error(_("revision arguments not allowed")); > + usage_with_options(usage, parseopts); > + } Okay. We restrict a bunch of usages, which makes your job simpler right now, but by dying it keeps the door open to iterate on those in the future. > + if (!revs.diffopt.output_format) > + revs.diffopt.output_format = DIFF_FORMAT_PATCH; Instead of setting this conditionally, can we already set it up as a default before calling `setup_revisions()`? > + while (1) { > + struct object_id oid_a, oid_b; > + struct diff_filepair *pair; > + unsigned mode_a, mode_b; > + const char *p; > + char status; > + > + if (strbuf_getline_nul(&meta, stdin) == EOF) > + break; > + > + p = meta.buf; > + if (*p != ':') > + die(_("invalid raw diff input")); > + p++; > + > + mode_a = parse_mode_or_die(p, &p); > + mode_b = parse_mode_or_die(p, &p); > + > + if (S_ISDIR(mode_a) || S_ISDIR(mode_b)) > + die(_("tree objects not supported")); I assume submodules aren't supported either, are they? If so, do we also have to check for `S_ISGITLINK()`? It would be nice to have a test for them. Patrick