Re: [PATCH v6 10/13] convert: generate large test files only once

Jeff King <peff@xxxxxxxx> · Tue, 30 Aug 2016 12:37:49 -0400

On Tue, Aug 30, 2016 at 01:41:59PM +0200, Lars Schneider wrote:

> >> +	git checkout -- test test.t test.i &&
> >> +
> >> +	mkdir generated-test-data &&
> >> +	for i in $(test_seq 1 $T0021_LARGE_FILE_SIZE)
> >> +	do
> >> +		RANDOM_STRING="$(test-genrandom end $i | tr -dc "A-Za-z0-9" )"
> >> +		ROT_RANDOM_STRING="$(echo $RANDOM_STRING | ./rot13.sh )"
> > 
> > In earlier iteration of loop with lower $i, what guarantees that
> > some bytes survive "tr -dc"?
> 
> Nothing really, good catch! The seed "end" produces as first character always a 
> "S" which would survive "tr -dc". However, that is clunky. I will always set "1"
> as first character in $RANDOM_STRING to mitigate the problem.

It seems odd that you would generate a larger set of random bytes and
then throw most of them away (about 1 in 5, on average). So you don't
actually know how long your inputs are, and you're wasting time
generating bytes which are discarded.

The goal looks like it is just to clean up the string to only-ASCII
characters. Perhaps converting to to base64 or hex would be conceptually
simpler? Like:

  test-genrandom ... |
  perl -pe 's/./hex(ord($&))/sge'

> > I do not quite get the point of this complexity.  You are using
> > exactly the same seed "end" every time, so in the first round you
> > have 1M of SP, letter '1', letter 'S' (from the genrandom), then
> > in the second round you have 1M of SP, letter '1', letter 'S' and
> > letter 'p' (the last two from the genrandom), and go on.  Is it
> > significant for the purpose of your test that the cruft inserted
> > between the repetition of 1M of SP gets longer by one byte but they
> > all share the same prefix (e.g. "1S", "1Sp", "1SpZ", "1SpZT",
> > ... are what you insert between a large run of spaces)?
> 
> The pktline packets have a constant size. If the cruft between 1M of SP 
> has a constant size as well then the generated packets for the test data
> would repeat themselves. That's why I increased the length after every 1M
> of SP.
> 
> However, I realized that this test complexity is not necessary. I'll
> simplify it in the next round.

I was also confused by this, and wondered if other patterns (perhaps
using a single longer genrandom) might be applicable. Simplification (or
explanation in comments about what properties the content _needs_ to
have) would be welcome. :)

-Peff