Re: [PATCH 6/6] New --dirstat=lines mode, doing dirstat analysis based on diffstat

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Johan Herland <johan@xxxxxxxxxxx> writes:

> This patch adds an alternative implementation of show_dirstat(), called
> show_dirstat_by_line(), which uses the more expensive diffstat analysis
> (as opposed to show_dirstat()'s own (relatively inexpensive) analysis)
> to derive the numbers from which the --dirstat output is computed.
>
> The alternative implementation is controlled by the new "lines" argument
> to the --dirstat option (or the diff.dirstat config variable).
>
> In linux-2.6.git, running the three different --dirstat modes:
>
>   time git diff v2.6.20..v2.6.30 --dirstat=changes > /dev/null
> vs.
>   time git diff v2.6.20..v2.6.30 --dirstat=lines > /dev/null
> vs.
>   time git diff v2.6.20..v2.6.30 --dirstat=files > /dev/null
>
> yields the following average runtimes on my machine:
>
> - "changes" (default): ~6.0 s
> - "lines":             ~9.6 s
> - "files":             ~0.1 s
>
> So, as expected, there's a considerable performance hit (~60%) by going
> through the full diffstat analysis as compared to the default "changes"
> analysis (obviously, "files" is much faster than both). As such, the
> "lines" mode is probably only useful if you really need the --dirstat
> numbers to be consistent with the numbers returned from the other
> --*stat options.
>
> The patch also includes documentation and tests for the new dirstat mode.

It needs to document and also mention in the proposed commit log message
how binary files are accounted for.

> @@ -1677,6 +1684,48 @@ found_damage:
>  	gather_dirstat(options, &dir, changed, "", 0);
>  }
>  
> +static void show_dirstat_by_line(struct diffstat_t *data, struct diff_options *options)
> +{
> +	int i;
> +	unsigned long changed;
> +	struct dirstat_dir dir;
> +
> +	if (data->nr == 0)
> +		return;
> +
> +	dir.files = NULL;
> +	dir.alloc = 0;
> +	dir.nr = 0;
> +	dir.percent = options->dirstat_percent;
> +	dir.cumulative = DIFF_OPT_TST(options, DIRSTAT_CUMULATIVE);
> +
> +	changed = 0;
> +	for (i = 0; i < data->nr; i++) {
> +		struct diffstat_file *file = data->files[i];
> +		unsigned long damage = file->added + file->deleted;
> +		if (damage && file->is_binary)
> +			/*
> +			 * binary files counts bytes, not lines. Must find some
> +			 * way to normalize binary bytes vs. textual lines.
> +			 * The following heuristic is cheap, but beyond ugly...
> +			 */
> +			damage = damage < 52 ? 1 : damage / 52;

If 52 is just as good as any number around 50-70 range, I would prefer to
see 64, just because I am superstitious and dividing by a power of two
feels nicer.

> +cat <<EOF >expect_diff_dirstat_CC
> +  16.7% changed/
> +  16.7% dst/copy/changed/
> +  16.7% dst/copy/rearranged/
> +  16.7% dst/move/changed/
> +  16.7% dst/move/rearranged/
> +  16.7% rearranged/
> +EOF

I really wish you can come up with a way to express expected results in
much less strict way in the test vector (not limited to the test vectors
for this patch but for the entire series).  The underlying count-damages
(for the purpose of rename detection) implementation may improve over time
and the textual diff generation may too.  Here what we want to preserve is
that these six entries show more-or-less the same amount of contribution,
not precisely 16.666666% each.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]