Re: BUG? in --dirstat when rearranging lines in a file

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Apr 7, 2011 at 6:49 AM, Johan Herland <johan@xxxxxxxxxxx> wrote:
>
> Consider the following sequence of commands:
> [...]
> $ git diff --stat
>  dir/file |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> $ git diff --dirstat
> $ # WTF!?

So the "--dirstat" thing really is different - it has never done a
full patch, it really only does a line hash count and then estimates
the amount of deleted/new code from that.

> Is this a bug or a feature? :)

It's a "bueature" or a "featug". The dirstat code counts "damage" as
you noticed, and it does that because it's easy and often relevant. It
can be computed without actually generating the whole diff, the same
way we do rename-detection without actually generating the diff, by
just looking at "hash and count each line" information. In fact, it
uses the same "diffcore_count_changes()" function for it.

So the reason I wouldn't call it a bug is that it very much is on
purpose. Generating a real diff is much more expensive, and instead
using the line hashes gives us a quick and efficient O(n) way to
gather up differences. But because it does the difference by basically
just comparing hashes of the lines without taking _ordering_ into
account, it gives you a "how many lines do these files have in common"
rather than a real diff.

So git internally has *three* different "difference" engines:

 - the "delta" algorithm that we use for packing (and binary diffs)
 - the traditional line-based diff for normal diffs
 - the "rename/copy/dirstat damage detectior" that doesn't take line
ordering into account, only some unordered "heap of data contents"
comparison.

You could think of the damage detection as a "rename detection within
a file". It's actually quite nice for "git diff -M --dirstat", where
it means that pure code movement - whether inside a file or by
renaming a file - doesn't count as damage.

(Of course, moving a piece of code _from_ one file to another still
counts as damage, so it's not really ignoring pure code movement).

NOTE! Speed isn't the only reason we do that "unordered heap of data
contents" comparison. For rename detection, we really don't want
moving a big function around in a file to be counted as "rewriting the
file". So there are actually other reasons to like these semantics.
That said, honestly, for dirstat, the big issue was that it made it
really really simple. Look at how small the dirstat patch was (commit
7df7c019c2a4), and realize that it's because it just re-used the
existing damage counting code.

                            Linus
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]