Re: [PATCH v3 2/3] reftable/stack: use geometric table compaction

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Mar 29, 2024 at 04:16:48AM +0000, Justin Tobler via GitGitGadget wrote:
> From: Justin Tobler <jltobler@xxxxxxxxx>
> 
> To reduce the number of on-disk reftables, compaction is performed.
> Contiguous tables with the same binary log value of size are grouped
> into segments. The segment that has both the lowest binary log value and
> contains more than one table is set as the starting point when
> identifying the compaction segment.
> 
> Since segments containing a single table are not initially considered
> for compaction, if the table appended to the list does not match the
> previous table log value, no compaction occurs for the new table. It is
> therefore possible for unbounded growth of the table list. This can be
> demonstrated by repeating the following sequence:
> 
> git branch -f foo
> git branch -d foo
> 
> Each operation results in a new table being written with no compaction
> occurring until a separate operation produces a table matching the
> previous table log value.
> 
> Instead, to avoid unbounded growth of the table list, the compaction
> strategy is updated to ensure tables follow a geometric sequence after
> each operation by individually evaluating each table in reverse index
> order. This strategy results in a much simpler and more robust algorithm
> compared to the previous one while also maintaining a minimal ordered
> set of tables on-disk.
> 
> When creating 10 thousand references, the new strategy has no
> performance impact:
> 
> Benchmark 1: update-ref: create refs sequentially (revision = HEAD~)
>   Time (mean ± σ):     26.516 s ±  0.047 s    [User: 17.864 s, System: 8.491 s]
>   Range (min … max):   26.447 s … 26.569 s    10 runs
> 
> Benchmark 2: update-ref: create refs sequentially (revision = HEAD)
>   Time (mean ± σ):     26.417 s ±  0.028 s    [User: 17.738 s, System: 8.500 s]
>   Range (min … max):   26.366 s … 26.444 s    10 runs
> 
> Summary
>   update-ref: create refs sequentially (revision = HEAD) ran
>     1.00 ± 0.00 times faster than update-ref: create refs sequentially (revision = HEAD~)
> 
> Some tests in `t0610-reftable-basics.sh` assert the on-disk state of
> tables and are therefore updated to specify the correct new table count.
> Since compaction is more aggressive in ensuring tables maintain a
> geometric sequence, the expected table count is reduced in these tests.
> In `reftable/stack_test.c` tests related to `sizes_to_segments()` are
> removed because the function is no longer needed. Also, the
> `test_suggest_compaction_segment()` test is updated to better showcase
> and reflect the new geometric compaction behavior.
> 
> Signed-off-by: Justin Tobler <jltobler@xxxxxxxxx>
> ---
>  reftable/stack.c           | 120 ++++++++++++++++++-------------------
>  reftable/stack.h           |   3 -
>  reftable/stack_test.c      |  67 +++++----------------
>  t/t0610-reftable-basics.sh |  45 +++++++++-----
>  4 files changed, 103 insertions(+), 132 deletions(-)
> 
> diff --git a/reftable/stack.c b/reftable/stack.c
> index 07262beaaf7..e7b9a1de5a4 100644
> --- a/reftable/stack.c
> +++ b/reftable/stack.c
> @@ -1202,75 +1202,73 @@ static int segment_size(struct segment *s)
>  	return s->end - s->start;
>  }
>  
> -int fastlog2(uint64_t sz)
> -{
> -	int l = 0;
> -	if (sz == 0)
> -		return 0;
> -	for (; sz; sz /= 2) {
> -		l++;
> -	}
> -	return l - 1;
> -}
> -
> -struct segment *sizes_to_segments(size_t *seglen, uint64_t *sizes, size_t n)
> -{
> -	struct segment *segs = reftable_calloc(n, sizeof(*segs));
> -	struct segment cur = { 0 };
> -	size_t next = 0, i;
> -
> -	if (n == 0) {
> -		*seglen = 0;
> -		return segs;
> -	}
> -	for (i = 0; i < n; i++) {
> -		int log = fastlog2(sizes[i]);
> -		if (cur.log != log && cur.bytes > 0) {
> -			struct segment fresh = {
> -				.start = i,
> -			};
> -
> -			segs[next++] = cur;
> -			cur = fresh;
> -		}
> -
> -		cur.log = log;
> -		cur.end = i + 1;
> -		cur.bytes += sizes[i];
> -	}
> -	segs[next++] = cur;
> -	*seglen = next;
> -	return segs;
> -}
> -
>  struct segment suggest_compaction_segment(uint64_t *sizes, size_t n)
>  {
> -	struct segment min_seg = {
> -		.log = 64,
> -	};
> -	struct segment *segs;
> -	size_t seglen = 0, i;
> -
> -	segs = sizes_to_segments(&seglen, sizes, n);
> -	for (i = 0; i < seglen; i++) {
> -		if (segment_size(&segs[i]) == 1)
> -			continue;
> +	struct segment seg = { 0 };
> +	uint64_t bytes;
> +	size_t i;
>  
> -		if (segs[i].log < min_seg.log)
> -			min_seg = segs[i];
> -	}
> +	/*
> +	 * If there are no tables or only a single one then we don't have to
> +	 * compact anything. The sequence is geometric by definition already.
> +	 */
> +	if (n <= 1)
> +		return seg;
>  
> -	while (min_seg.start > 0) {
> -		size_t prev = min_seg.start - 1;
> -		if (fastlog2(min_seg.bytes) < fastlog2(sizes[prev]))
> +	/*
> +	 * Find the ending table of the compaction segment needed to restore the
> +	 * geometric sequence.
> +	 *
> +	 * To do so, we iterate backwards starting from the most recent table
> +	 * until a valid segment end is found. If the preceding table is smaller
> +	 * than the current table multiplied by the geometric factor (2), the
> +	 * current table is set as the compaction segment end.
> +	 *
> +	 * Tables after the ending point are not added to the byte count because
> +	 * they are already valid members of the geometric sequence. Due to the
> +	 * properties of a geometric sequence, it is not possible for the sum of
> +	 * these tables to exceed the value of the ending point table.
> +	 *
> +	 * Example table size sequence requiring no compaction:
> +	 * 	64, 32, 16, 8, 4, 2, 1
> +	 *
> +	 * Example compaction segment end set to table with size 3:
> +	 * 	64, 32, 16, 8, 4, 3, 1
> +	 */
> +	for (i = n - 1; i > 0; i--) {
> +		if (sizes[i - 1] < sizes[i] * 2) {
> +			seg.end = i + 1;
> +			bytes = sizes[i];
>  			break;
> +		}
> +	}
>  
> -		min_seg.start = prev;
> -		min_seg.bytes += sizes[prev];
> +	/*
> +	 * Find the starting table of the compaction segment by iterating
> +	 * through the remaining tables and keeping track of the accumulated
> +	 * size of all tables seen from the segment end table. The previous
> +	 * table is compared to the accumulated size because the tables from the
> +	 * segment end are merged backwards recursively.
> +	 *
> +	 * Note that we keep iterating even after we have found the first
> +	 * starting point. This is because there may be tables in the stack
> +	 * preceding that first starting point which violate the geometric
> +	 * sequence.
> +	 *
> +	 * Example compaction segment start set to table with size 32:
> +	 * 	128, 32, 16, 8, 4, 3, 1
> +	 */
> +	for (; i > 0; i--) {
> +		uint64_t curr = bytes;
> +		bytes += sizes[i - 1];
> +
> +		if (sizes[i - 1] < curr * 2) {
> +			seg.start = i - 1;
> +			seg.bytes = bytes;
> +		}
>  	}
>  
> -	reftable_free(segs);
> -	return min_seg;
> +	return seg;
>  }
>  
>  static uint64_t *stack_table_sizes_for_compaction(struct reftable_stack *st)
> diff --git a/reftable/stack.h b/reftable/stack.h
> index d919455669e..656f896cc28 100644
> --- a/reftable/stack.h
> +++ b/reftable/stack.h
> @@ -33,12 +33,9 @@ int read_lines(const char *filename, char ***lines);
>  
>  struct segment {
>  	size_t start, end;
> -	int log;
>  	uint64_t bytes;
>  };
>  
> -int fastlog2(uint64_t sz);
> -struct segment *sizes_to_segments(size_t *seglen, uint64_t *sizes, size_t n);
>  struct segment suggest_compaction_segment(uint64_t *sizes, size_t n);
>  
>  #endif
> diff --git a/reftable/stack_test.c b/reftable/stack_test.c
> index 7336757cf53..21541742fe5 100644
> --- a/reftable/stack_test.c
> +++ b/reftable/stack_test.c
> @@ -717,59 +717,13 @@ static void test_reftable_stack_hash_id(void)
>  	clear_dir(dir);
>  }
>  
> -static void test_log2(void)
> -{
> -	EXPECT(1 == fastlog2(3));
> -	EXPECT(2 == fastlog2(4));
> -	EXPECT(2 == fastlog2(5));
> -}
> -
> -static void test_sizes_to_segments(void)
> -{
> -	uint64_t sizes[] = { 2, 3, 4, 5, 7, 9 };
> -	/* .................0  1  2  3  4  5 */
> -
> -	size_t seglen = 0;
> -	struct segment *segs =
> -		sizes_to_segments(&seglen, sizes, ARRAY_SIZE(sizes));
> -	EXPECT(segs[2].log == 3);
> -	EXPECT(segs[2].start == 5);
> -	EXPECT(segs[2].end == 6);
> -
> -	EXPECT(segs[1].log == 2);
> -	EXPECT(segs[1].start == 2);
> -	EXPECT(segs[1].end == 5);
> -	reftable_free(segs);
> -}
> -
> -static void test_sizes_to_segments_empty(void)
> -{
> -	size_t seglen = 0;
> -	struct segment *segs = sizes_to_segments(&seglen, NULL, 0);
> -	EXPECT(seglen == 0);
> -	reftable_free(segs);
> -}
> -
> -static void test_sizes_to_segments_all_equal(void)
> -{
> -	uint64_t sizes[] = { 5, 5 };
> -	size_t seglen = 0;
> -	struct segment *segs =
> -		sizes_to_segments(&seglen, sizes, ARRAY_SIZE(sizes));
> -	EXPECT(seglen == 1);
> -	EXPECT(segs[0].start == 0);
> -	EXPECT(segs[0].end == 2);
> -	reftable_free(segs);
> -}
> -
>  static void test_suggest_compaction_segment(void)
>  {
> -	uint64_t sizes[] = { 128, 64, 17, 16, 9, 9, 9, 16, 16 };
> -	/* .................0    1    2  3   4  5  6 */
> +	uint64_t sizes[] = { 512, 64, 17, 16, 9, 9, 9, 16, 2, 16 };
>  	struct segment min =
>  		suggest_compaction_segment(sizes, ARRAY_SIZE(sizes));
> -	EXPECT(min.start == 2);
> -	EXPECT(min.end == 7);
> +	EXPECT(min.start == 1);
> +	EXPECT(min.end == 10);
>  }
>  
>  static void test_suggest_compaction_segment_nothing(void)
> @@ -880,6 +834,17 @@ static void test_empty_add(void)
>  	reftable_stack_destroy(st2);
>  }
>  
> +static int fastlog2(uint64_t sz)
> +{
> +	int l = 0;
> +	if (sz == 0)
> +		return 0;
> +	for (; sz; sz /= 2) {
> +		l++;
> +	}

Nit: we could drop the curly braces while at it.

> +	return l - 1;
> +}
> +
>  static void test_reftable_stack_auto_compaction(void)
>  {
>  	struct reftable_write_options cfg = { 0 };
> @@ -1068,7 +1033,6 @@ static void test_reftable_stack_compaction_concurrent_clean(void)
>  int stack_test_main(int argc, const char *argv[])
>  {
>  	RUN_TEST(test_empty_add);
> -	RUN_TEST(test_log2);
>  	RUN_TEST(test_names_equal);
>  	RUN_TEST(test_parse_names);
>  	RUN_TEST(test_read_file);
> @@ -1088,9 +1052,6 @@ int stack_test_main(int argc, const char *argv[])
>  	RUN_TEST(test_reftable_stack_update_index_check);
>  	RUN_TEST(test_reftable_stack_uptodate);
>  	RUN_TEST(test_reftable_stack_validate_refname);
> -	RUN_TEST(test_sizes_to_segments);
> -	RUN_TEST(test_sizes_to_segments_all_equal);
> -	RUN_TEST(test_sizes_to_segments_empty);
>  	RUN_TEST(test_suggest_compaction_segment);
>  	RUN_TEST(test_suggest_compaction_segment_nothing);
>  	return 0;
> diff --git a/t/t0610-reftable-basics.sh b/t/t0610-reftable-basics.sh
> index 434044078ed..b95626549e7 100755
> --- a/t/t0610-reftable-basics.sh
> +++ b/t/t0610-reftable-basics.sh
> @@ -293,7 +293,7 @@ test_expect_success 'ref transaction: writes cause auto-compaction' '
>  	test_line_count = 1 repo/.git/reftable/tables.list &&
>  
>  	test_commit -C repo --no-tag A &&
> -	test_line_count = 2 repo/.git/reftable/tables.list &&
> +	test_line_count = 1 repo/.git/reftable/tables.list &&
>  
>  	test_commit -C repo --no-tag B &&
>  	test_line_count = 1 repo/.git/reftable/tables.list
> @@ -308,12 +308,24 @@ test_expect_success 'ref transaction: environment variable disables auto-compact
>  	do
>  		GIT_TEST_REFTABLE_NO_AUTOCOMPACTION=true git -C repo update-ref branch-$i HEAD || return 1
>  	done &&
> -	test_line_count = 23 repo/.git/reftable/tables.list &&
> +	test_line_count = 22 repo/.git/reftable/tables.list &&
>  
>  	git -C repo update-ref foo HEAD &&
>  	test_line_count = 1 repo/.git/reftable/tables.list
>  '
>  
> +test_expect_success 'ref transaction: alternating table sizes are compacted' '
> +	test_when_finished "rm -rf repo" &&
> +	git init repo &&
> +	test_commit -C repo A &&
> +	for i in $(test_seq 20)

Nit: we could reduce the number of iterations here so that we don't have
to spawn 40 Git commands. Using something like 10 or even 5 iterations
should be sufficient to demonstrate the problem, no?

Patrick

Attachment: signature.asc
Description: PGP signature


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux