On Tuesday 12 April 2011, Linus Torvalds wrote: > On Tue, Apr 12, 2011 at 7:46 AM, Linus Torvalds > > <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > > So I don't hate the idea, but I do hate the "use a config option" > > part. Or rather, I hate the fact that it's the _only_ way to do it > > (and the particular config name you chose). > > Oh, and one thing strikes me: I think the fast dirstat gave reasonable > values when you had mixed text and binary (in the kernel tree, look > for the Documentation/logo.gif file, for example: it changed to the > Tasmanian devil in one release). > > Have you checked what happens to that when you use the diffstat one? > Because binary files are done very differently (byte-based counts). > > So check out > > git show --dirstat 3d4f16348b77efbf81b7fa186a18a0eb815b6b84 > > with and without your change. The old dirstat gives > > 44.0% Documentation/ > 55.9% drivers/video/logo/ > > which is at least not completely insane. My change obviously makes a difference: 68.7% Documentation/ 31.2% drivers/video/logo/ To make some more sense of the number, here they are with some extra output from a debug printf: $ ../git/git show --dirstat 3d4f163 [...] Documentation/logo.gif: +16335 -0 => damage = 16335 Documentation/logo.svg: +0 -310450 => damage = 310450 Documentation/logo.txt: +562 -200 => damage = 762 drivers/video/logo/logo_linux_clut224.ppm: +76628 -136093 => damage = 212721 drivers/video/logo/logo_linux_vga16.ppm: +76837 -126084 => damage = 202921 44.0% Documentation/ 55.9% drivers/video/logo/ $ ../git/git -c diff.dirstatbasedondiffstat=true show --dirstat 3d4f163 [...] Documentation/logo.gif: +16335 -0 => damage = 16335 Documentation/logo.svg: +0 -2911 => damage = 2911 Documentation/logo.txt: +12 -3 => damage = 15 drivers/video/logo/logo_linux_clut224.ppm: +1602 -2826 => damage = 4428 drivers/video/logo/logo_linux_vga16.ppm: +1602 -2737 => damage = 4339 68.7% Documentation/ 31.2% drivers/video/logo/ In the original dirstat numbers (computed by diffcore_count_changes()) all the numbers (both from text and binary files) are on a byte scale. (making the binary logo.gif changes proportional in scale to the rest). In the diffstat analysis, however, binary changes are reported in bytes, while text changes are reported in lines. This obviously makes binary changes count disproportionately more than textual changes. > The reason I bring this up is because I think this was an issue at one > point, and one of the statistics things (--stat or --numstat or > --dirstat) gave absolutely horrid values (basically comparing "bytes > changed" for binaries with "lines changed" for text files). Resulting > in totally skewed statistics. Indeed, that's exactly what's going on here. Looking at the other --*stat options: --stat has a special output mode for binary files: Documentation/logo.gif | Bin 0 -> 16335 bytes --numstat refuses to show any meaningful output for binary files: - - Documentation/logo.gif --shortstat skips binary (and unmerged) files altogether. So, how should we count binary files in the diffstat version of --dirstat? Looking at the available data in struct diffstat_file, there's not a lot of "source material" available. If I had easy access to the file pre/post size, and the total number of lines, I could calculate the average number of bytes per line, and then multiply that with the diffstat numbers to get an approximate byte count. A crude fallback would be to use 64 bytes per line... A better solution might be to add a flag to struct diffstat_t indicating that we want byte counts (as opposed to line counts) for text files, and then use that flag from within diffstat_consume() to add "len" instead of "1" to x->added/deleted. ...Johan -- Johan Herland, <johan@xxxxxxxxxxx> www.herland.net -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html