On Sun, Dec 15, 2024 at 03:24:11PM -0800, Junio C Hamano wrote: > > Funny. The raw diff format indeed was designed as an interchange > > format from various "compare two sets of things" front-ends (like > > diff-files, diff-cache, and diff-tree) that emits the raw format, to > > be read by "diff-helper" (initially called "diff-tree-helper") that > > takes the raw format and > > > > - matches removed and added paths with similar contents to detect > > renames and copies > > > > - computes the output in various formats including "patch". > > > > So I guess we came a full circle, finally ;-). Looking in the archive > > for messages exchanged between junkio@ and torvalds@ mentioning diff > > before 2005-05-30 finds some interesting gems. > > > > https://lore.kernel.org/git/7v1x8zsamn.fsf_-_@xxxxxxxxxxxxxxxxxxxxxxxx/ :) That same thread was linked when I posted the original diff-pairs many years ago. > So, if we were to do what Justin tried to do honoring the overall > design of our diff machinery, I think what we can do is as follows: > > * Use the "diff --raw" output format as the input, but with a bit > of twist. > > (1) a narrow special case that takes only a single diff_filepair > of <old> and <new> blobs, and immediately run diff_queue() on > that single diff_filepair, which is Justin's use case. For > this mode of operation, "flush after reach record of input" > may be sufficient. My understanding was that he does not actually care about this case (just feeding two blobs), but is actually processing --raw output in the first place. Or did you just mean that we'd still be feeding raw output, but just getting the flush behavior? > (2) as a general "interchange format" to feed "comparison between > two sets of <object, path>" into our diff machinery, we are > better off if we can treat the input stream as multiple > records that describes comparison between two sets. Imagine > "git log --oneline --first-parent -2 --raw HEAD", where one > set of "diff --raw" records show the changed blobs with their > paths between HEAD~1 and HEAD, and another set does so for > HEAD~2 and HEAD~1. We need to be able to tell where the > first set ends and the second set starts, so that rename > detection and other things, if requested, can be done within > each set. Seems reasonable. For the use of diff-pairs at GitHub, I always just did full-tree things like rename detection in the initial diff-tree invocation. Since my goal was splitting/filtering, doing it after would yield wrong answers (since diff-pairs never sees the complete set). But it's possible for somebody to want to filter the intermediate results, then do full-tree stuff on the result (or even just delay the cost of rename detection). And certainly it's possible to want to feed a whole bunch of unrelated diff segments without having to spawn a process for each. So it's not something I wanted, but I agree it's good to plan for. > My recommendation is to use a single blank line as a separator, > e.g. > > :100644 100644 ce31f93061 9829984b0a M Documentation/git-refs.txt > :100644 100644 8b3882cff1 4a74f7c7bd M refs.c > :100755 100755 1bfff3a7af f59bc4860f M t/t1460-refs-migrate.sh > > :100644 100644 c11213f520 8953d1c6d3 M refs/files-backend.c > :100644 100644 b2e3ba877d bec5962deb M refs/reftable-backend.c > > so an application that wants to compare only one diff_filepair > at a time would issue something like > > :100644 100644 ce31f93061 9829984b0a M Documentation/git-refs.txt > > :100644 100644 8b3882cff1 4a74f7c7bd M refs.c > > :100755 100755 1bfff3a7af f59bc4860f M t/t1460-refs-migrate.sh > > so the parsing machinery does not have to worry about case (1) above. Yeah, that seems good. And it is backwards-compatible with the existing diff-pairs format (which just barfs on a blank line). That's not a big concern for the project, but it is nice that it makes life a bit simpler for folks who would eventually be tasked with switching from it to this new tool. ;) > * Parse and append the input into diff_queue(), until you see an > blank line. > > - If at EOF you are done, but if you have something accumulated > in diff_queue(), show them (like below) first. In any case, at > EOF, you are done. Yep, makes sense. > * Run diffcore_std() followed by diff_flush() to have the contents > of the queue nicely formatted and emptied. Go back to parsing > more input lines. That makes sense. I don't think my diff-pairs runs diffcore_std() at all. The plumbing defaults mean it would always be a noop unless you explicitly passed in rename, etc, options, and I never wanted to do that. We might have to check the interaction of diffcore code on a set of queued diffs that already have values for renames, etc. I.e., that: git diff-tree --raw -M | git diff-pairs -M does not barf, since the input step in diff-pairs is going to set status to 'R', etc, in the pairs. -Peff