Re: [PATCH 1/2] diffcore-break: don't divide by zero

Junio C Hamano <gitster@xxxxxxxxx> · Tue, 02 Apr 2013 15:41:10 -0700

John Keeping <john@xxxxxxxxxxxxx> writes:

> The message for commit 6dd4b66 (Fix diffcore-break total breakage)
> indicates that "don't bother to break small files" is wrong in some
> cases, but it I wonder if "don't bother to break empty files" is okay.

This has a rather subtle ramifications, and we would need to think
carefully.  "break" does two very different things, and the criteria
you would want to use to decide to "break" a filepair depends on
which one you are interested in.

The very original motivation of "break" was to show a patch that
rewrites an existing file completely in a way different from the
usual "diff -u" output, which will try to minimize the output by
finding (rare) lines that happen to be the same between the preimage
and postimage, intersparsed in many lines of "-" (deletion) and "+"
(addition).  Such a change is often easier to understand when shown
as a single block of "-" (deletion) of the entire original contents,
followed by a single block of "+" (addition) of the entire new
contents.

A totally separate motivation of "break" is the one Linus talks
about in the log message of the said commit.  A path filename.h was
moved to filename_32.h, and a new (and much smaller) filename.h was
introduced, that "#include"s filename_32.h.  "diff -M" that pairs
deleted files with created files to compute renames in such a case
would not consider the original filename.h as a possible source of
filename_32.h that was created.  You want to break modification of
filename.h into (virtual) deletion and addition of filename.h.

For the purpose of the former, you would want not to break a file
too aggressively.  If you started from a file with 1000 lines in it,
and deleted 910 lines and added 10 lines to result in a file with
100 lines, you still have 90 lines of shared contents between the
preimage and the postimage, and you do not want to show it as
"delete 1000 lines and add 100 lines".  You would want to base your
decision on how much common material exists between the two.

For the purpose of the latter, however, it is better to err on the
side of breaking than not breaking.  After breaking a suspicious
modification into addition and deletion, if rename comparison does
not find a suitable destination for the virtual deletion, the broken
halves will be merged back, so breaking too little can hurt by
missing potential renames, but breaking too much will not.  You do
want to break the "900 deleted out of 1000 and then added 10" case,
as that 900 lines may have gone to another file that was created in
the same commit.  "But we have 90 lines of common material" does not
matter for the decision to break.

In today's code, the return value of should_break() is about the
latter, while the score it stores in *merge_score is about the
former.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html