Re: [PATCH] reftable/stack: use geometric table compaction

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Mar 05, 2024 at 08:03:45PM +0000, Justin Tobler via GitGitGadget wrote:
> From: Justin Tobler <jltobler@xxxxxxxxx>
> 
> To reduce the number of on-disk reftables, compaction is performed.
> Contiguous tables with the same binary log value of size are grouped
> into segments. The segment that has both the lowest binary log value and
> contains more than one table is set as the starting point when
> identifying the compaction segment.
> 
> Since segments containing a single table are not initially considered
> for compaction, if the table appended to the list does not match the
> previous table log value, no compaction occurs for the new table. It is
> therefore possible for unbounded growth of the table list. This can be
> demonstrated by repeating the following sequence:
> 
> git branch -f foo
> git branch -d foo
> 
> Each operation results in a new table being written with no compaction
> occurring until a separate operation produces a table matching the
> previous table log value.
> 
> To avoid unbounded growth of the table list, walk through each table and
> evaluate if it needs to be included in the compaction segment to restore
> a geometric sequence.

I think the description of what exactly changes could use some more
explanation and some arguments why the new behaviour is okay, too. It's
quite a large rewrite of the compaction logic, so pinpointing exactly
how these are different would go a long way.

> Some tests in `t0610-reftable-basics.sh` assert the on-disk state of
> tables and are therefore updated to specify the correct new table count.
> Since compaction is more aggressive in ensuring tables maintain a
> geometric sequence, the expected table count is reduced in these tests.
> In `reftable/stack_test.c` tests related to `sizes_to_segments()` are
> removed because the function is no longer needed. Also, the
> `test_suggest_compaction_segment()` test is updated to better showcase
> and reflect the new geometric compaction behavior.
> 
> Signed-off-by: Justin Tobler <jltobler@xxxxxxxxx>
> ---
>     reftable/stack: use geometric table compaction
> 
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1683%2Fjltobler%2Fjt%2Freftable-geometric-compaction-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1683/jltobler/jt/reftable-geometric-compaction-v1
> Pull-Request: https://github.com/gitgitgadget/git/pull/1683
> 
>  reftable/stack.c           | 106 +++++++++++++++----------------------
>  reftable/stack.h           |   3 --
>  reftable/stack_test.c      |  66 +++++------------------
>  t/t0610-reftable-basics.sh |  24 ++++-----
>  4 files changed, 70 insertions(+), 129 deletions(-)
> 
> diff --git a/reftable/stack.c b/reftable/stack.c
> index b64e55648aa..e4ea8753977 100644
> --- a/reftable/stack.c
> +++ b/reftable/stack.c
> @@ -1214,75 +1214,57 @@ static int segment_size(struct segment *s)
>  	return s->end - s->start;
>  }
>  
> -int fastlog2(uint64_t sz)
> -{
> -	int l = 0;
> -	if (sz == 0)
> -		return 0;
> -	for (; sz; sz /= 2) {
> -		l++;
> -	}
> -	return l - 1;
> -}
> -
> -struct segment *sizes_to_segments(size_t *seglen, uint64_t *sizes, size_t n)
> -{
> -	struct segment *segs = reftable_calloc(n, sizeof(*segs));
> -	struct segment cur = { 0 };
> -	size_t next = 0, i;
> -
> -	if (n == 0) {
> -		*seglen = 0;
> -		return segs;
> -	}
> -	for (i = 0; i < n; i++) {
> -		int log = fastlog2(sizes[i]);
> -		if (cur.log != log && cur.bytes > 0) {
> -			struct segment fresh = {
> -				.start = i,
> -			};
> -
> -			segs[next++] = cur;
> -			cur = fresh;
> -		}
> -
> -		cur.log = log;
> -		cur.end = i + 1;
> -		cur.bytes += sizes[i];
> -	}
> -	segs[next++] = cur;
> -	*seglen = next;
> -	return segs;
> -}
> -
>  struct segment suggest_compaction_segment(uint64_t *sizes, size_t n)
>  {
> -	struct segment min_seg = {
> -		.log = 64,
> -	};
> -	struct segment *segs;
> -	size_t seglen = 0, i;
> -
> -	segs = sizes_to_segments(&seglen, sizes, n);
> -	for (i = 0; i < seglen; i++) {
> -		if (segment_size(&segs[i]) == 1)
> -			continue;
> +	struct segment seg = { 0 };
> +	uint64_t bytes;
> +	size_t i;
>  
> -		if (segs[i].log < min_seg.log)
> -			min_seg = segs[i];
> -	}
> +	/*
> +	 * If there are no tables or only a single one then we don't have to
> +	 * compact anything. The sequence is geometric by definition already.
> +	 */
> +	if (n <= 1)
> +		return seg;
>  
> -	while (min_seg.start > 0) {
> -		size_t prev = min_seg.start - 1;
> -		if (fastlog2(min_seg.bytes) < fastlog2(sizes[prev]))
> +	/*
> +	 * Find the ending table of the compaction segment needed to restore the
> +	 * geometric sequence.
> +	 *
> +	 * To do so, we iterate backwards starting from the most recent table
> +	 * until a valid segment end is found. If the preceding table is smaller
> +	 * than the current table multiplied by the geometric factor (2), the
> +	 * current table is set as the compaction segment end.
> +	 */
> +	for (i = n - 1; i > 0; i--) {
> +		if (sizes[i - 1] < sizes[i] * 2) {
> +			seg.end = i;
> +			bytes = sizes[i];
>  			break;
> +		}
> +	}

I was briefly wondering whether we have to compare the _sum_ of all
preceding table sizes to the next size here. Otherwise it may happen
that compaction will lead to a new table that is immediately violating
the geometric sequence again.

But I think due to properties of the geometric sequence, the sum of all
entries preceding the current value cannot be greater than the value
itself. So this should be fine.

This might be worth a comment.

> +
> +	/*
> +	 * Find the starting table of the compaction segment by iterating
> +	 * through the remaing tables and keeping track of the accumulated size

s/remaing/remaining/

> +	 * of all tables seen from the segment end table.
> +	 *
> +	 * Note that we keep iterating even after we have found the first
> +	 * first starting point. This is because there may be tables in the

Nit: s/first//, duplicate word.

> +	 * stack preceding that first starting point which violate the geometric
> +	 * sequence.
> +	 */
> +	for (; i > 0; i--) {
> +		uint64_t curr = bytes;
> +		bytes += sizes[i - 1];
>  
> -		min_seg.start = prev;
> -		min_seg.bytes += sizes[prev];
> +		if (sizes[i - 1] < curr * 2) {
> +			seg.start = i - 1;
> +			seg.bytes = bytes;
> +		}
>  	}

Overall I really like the rewritten algorithm, it's a ton easier to
understand compared to the preceding code.

One thing I'd suggest doing though is to provide a benchmark of how the
new compaction strategy compares to the old one. A comparatively easy
way to do this is to write N refs sequentially -- with a big enough N
(e.g. 1 million), compaction time will eventually become an important
factor. So something like the following (untested):

hyperfine \
  --prepare "rm -rf repo && git init --ref-format=reftable repo && git -C repo commit --allow-empty --message msg" \
  'for ((i = 0 ; i < 1000000; i++ )); do git -C repo update-ref refs/heads/branch-$i HEAD'

>  
> -	reftable_free(segs);
> -	return min_seg;
> +	return seg;
>  }
>  
>  static uint64_t *stack_table_sizes_for_compaction(struct reftable_stack *st)
[snip]
> @@ -737,10 +737,10 @@ test_expect_success 'worktree: pack-refs in main repo packs main refs' '
>  	test_commit -C repo A &&
>  	git -C repo worktree add ../worktree &&
>  
> -	test_line_count = 3 repo/.git/worktrees/worktree/reftable/tables.list &&
> -	test_line_count = 4 repo/.git/reftable/tables.list &&
> +	test_line_count = 1 repo/.git/worktrees/worktree/reftable/tables.list &&
> +	test_line_count = 1 repo/.git/reftable/tables.list &&
>  	git -C repo pack-refs &&
> -	test_line_count = 3 repo/.git/worktrees/worktree/reftable/tables.list &&
> +	test_line_count = 1 repo/.git/worktrees/worktree/reftable/tables.list &&
>  	test_line_count = 1 repo/.git/reftable/tables.list
>  '

This test needs updating as git-pack-refs(1) has become a no-op here.

> @@ -750,11 +750,11 @@ test_expect_success 'worktree: pack-refs in worktree packs worktree refs' '
>  	test_commit -C repo A &&
>  	git -C repo worktree add ../worktree &&
>  
> -	test_line_count = 3 repo/.git/worktrees/worktree/reftable/tables.list &&
> -	test_line_count = 4 repo/.git/reftable/tables.list &&
> +	test_line_count = 1 repo/.git/worktrees/worktree/reftable/tables.list &&
> +	test_line_count = 1 repo/.git/reftable/tables.list &&
>  	git -C worktree pack-refs &&
>  	test_line_count = 1 repo/.git/worktrees/worktree/reftable/tables.list &&
> -	test_line_count = 4 repo/.git/reftable/tables.list
> +	test_line_count = 1 repo/.git/reftable/tables.list
>  '

Same.

>  test_expect_success 'worktree: creating shared ref updates main stack' '
> @@ -770,7 +770,7 @@ test_expect_success 'worktree: creating shared ref updates main stack' '
>  
>  	git -C worktree update-ref refs/heads/shared HEAD &&
>  	test_line_count = 1 repo/.git/worktrees/worktree/reftable/tables.list &&
> -	test_line_count = 2 repo/.git/reftable/tables.list
> +	test_line_count = 1 repo/.git/reftable/tables.list
>  '

Same.

One thing missing is a test that demonstrates the previously-broken
behaviour.

Patrick

>  test_expect_success 'worktree: creating per-worktree ref updates worktree stack' '
> 
> base-commit: b387623c12f3f4a376e4d35a610fd3e55d7ea907
> -- 
> gitgitgadget
> 

Attachment: signature.asc
Description: PGP signature


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux