Re: [PATCH 6/6] New --dirstat=lines mode, doing dirstat analysis based on diffstat

Johan Herland <johan@xxxxxxxxxxx> · Wed, 27 Apr 2011 04:02:47 +0200

On Tuesday 26 April 2011, Junio C Hamano wrote:
> Johan Herland <johan@xxxxxxxxxxx> writes:
>
> [...]
> It needs to document and also mention in the proposed commit log message
> how binary files are accounted for.

Will do.

> [...]
> > +		if (damage && file->is_binary)
> > +			/*
> > +			 * binary files counts bytes, not lines. Must find some
> > +			 * way to normalize binary bytes vs. textual lines.
> > +			 * The following heuristic is cheap, but beyond ugly...
> > +			 */
> > +			damage = damage < 52 ? 1 : damage / 52;
> 
> If 52 is just as good as any number around 50-70 range, I would prefer to
> see 64, just because I am superstitious and dividing by a power of two
> feels nicer.

Will do.

> > +cat <<EOF >expect_diff_dirstat_CC
> > +  16.7% changed/
> > +  16.7% dst/copy/changed/
> > +  16.7% dst/copy/rearranged/
> > +  16.7% dst/move/changed/
> > +  16.7% dst/move/rearranged/
> > +  16.7% rearranged/
> > +EOF
> 
> I really wish you can come up with a way to express expected results in
> much less strict way in the test vector (not limited to the test vectors
> for this patch but for the entire series).  The underlying count-damages
> (for the purpose of rename detection) implementation may improve over
> time and the textual diff generation may too.  Here what we want to
> preserve is that these six entries show more-or-less the same amount of
> contribution, not precisely 16.666666% each.

Yeah, that does make sense, although I haven't yet thought of a good way to 
do this without losing important details. I thought of assigning letters to 
each percentage value, so that instead of

	cat <<EOF >expect_diff_dirstat_M
	   5.3% changed/
	  26.3% dst/copy/changed/
	  26.3% dst/copy/rearranged/
	  26.3% dst/copy/unchanged/
	   5.3% dst/move/changed/
	   5.3% dst/move/rearranged/
	   5.3% rearranged/
	EOF

	cat <<EOF >expect_diff_dirstat_CC
	  16.7% changed/
	  16.7% dst/copy/changed/
	  16.7% dst/copy/rearranged/
	  16.7% dst/move/changed/
	  16.7% dst/move/rearranged/
	  16.7% rearranged/
	EOF

you'd have

	cat <<EOF >expect_diff_dirstat_M
	     A% changed/
	     B% dst/copy/changed/
	     B% dst/copy/rearranged/
	     B% dst/copy/unchanged/
	     A% dst/move/changed/
	     A% dst/move/rearranged/
	     A% rearranged/
	EOF

	cat <<EOF >expect_diff_dirstat_CC
	     A% changed/
	     A% dst/copy/changed/
	     A% dst/copy/rearranged/
	     A% dst/move/changed/
	     A% dst/move/rearranged/
	     A% rearranged/
	EOF

but that would lose too much detail in many cases...

Also, I partly like the fact that we test the output strictly. That way, 
when someone _does_ change the underlying details, they get an immediate 
test failure telling them exactly _how_ the numbers changed. They can then 
use that to verify that their changes produce the expected results, before 
finally updating the test in accordance with their new numbers.

...Johan

-- 
Johan Herland, <johan@xxxxxxxxxxx>
www.herland.net
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html