Re: [RFC/PATCH 5/3] Alternative --dirstat implementation, based on diffstat analysis

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tuesday 12 April 2011, Linus Torvalds wrote:
> On Tue, Apr 12, 2011 at 7:46 AM, Linus Torvalds
> 
> <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> > So I don't hate the idea, but I do hate the "use a config option"
> > part. Or rather, I hate the fact that it's the _only_ way to do it
> > (and the particular config name you chose).
> 
> Oh, and one thing strikes me: I think the fast dirstat gave reasonable
> values when you had mixed text and binary (in the kernel tree, look
> for the Documentation/logo.gif file, for example: it changed to the
> Tasmanian devil in one release).
> 
> Have you checked what happens to that when you use the diffstat one?
> Because binary files are done very differently (byte-based counts).
> 
> So check out
> 
>    git show --dirstat 3d4f16348b77efbf81b7fa186a18a0eb815b6b84
> 
> with and without your change. The old dirstat gives
> 
>   44.0% Documentation/
>   55.9% drivers/video/logo/
> 
> which is at least not completely insane.

My change obviously makes a difference:

  68.7% Documentation/
  31.2% drivers/video/logo/

To make some more sense of the number, here they are with some extra
output from a debug printf:

$ ../git/git show --dirstat 3d4f163
  [...]

        Documentation/logo.gif: +16335 -0 => damage = 16335
        Documentation/logo.svg: +0 -310450 => damage = 310450
        Documentation/logo.txt: +562 -200 => damage = 762
        drivers/video/logo/logo_linux_clut224.ppm: +76628 -136093 => damage = 212721
        drivers/video/logo/logo_linux_vga16.ppm: +76837 -126084 => damage = 202921
  44.0% Documentation/
  55.9% drivers/video/logo/


$ ../git/git -c diff.dirstatbasedondiffstat=true show --dirstat 3d4f163
  [...]

        Documentation/logo.gif: +16335 -0 => damage = 16335
        Documentation/logo.svg: +0 -2911 => damage = 2911
        Documentation/logo.txt: +12 -3 => damage = 15
        drivers/video/logo/logo_linux_clut224.ppm: +1602 -2826 => damage = 4428
        drivers/video/logo/logo_linux_vga16.ppm: +1602 -2737 => damage = 4339
  68.7% Documentation/
  31.2% drivers/video/logo/

In the original dirstat numbers (computed by diffcore_count_changes())
all the numbers (both from text and binary files) are on a byte scale.
(making the binary logo.gif changes proportional in scale to the rest).

In the diffstat analysis, however, binary changes are reported in bytes,
while text changes are reported in lines. This obviously makes binary
changes count disproportionately more than textual changes.

> The reason I bring this up is because I think this was an issue at one
> point, and one of the statistics things (--stat or --numstat or
> --dirstat) gave absolutely horrid values (basically comparing "bytes
> changed" for binaries with "lines changed" for text files). Resulting
> in totally skewed statistics.

Indeed, that's exactly what's going on here. Looking at the other
--*stat options:

--stat has a special output mode for binary files:

        Documentation/logo.gif      | Bin 0 -> 16335 bytes

--numstat refuses to show any meaningful output for binary files:

        -       -       Documentation/logo.gif

--shortstat skips binary (and unmerged) files altogether.


So, how should we count binary files in the diffstat version of
--dirstat? Looking at the available data in struct diffstat_file,
there's not a lot of "source material" available. If I had easy
access to the file pre/post size, and the total number of lines,
I could calculate the average number of bytes per line, and then
multiply that with the diffstat numbers to get an approximate
byte count. A crude fallback would be to use 64 bytes per line...

A better solution might be to add a flag to struct diffstat_t
indicating that we want byte counts (as opposed to line counts) for
text files, and then use that flag from within diffstat_consume() to
add "len" instead of "1" to x->added/deleted.


...Johan

-- 
Johan Herland, <johan@xxxxxxxxxxx>
www.herland.net
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]