Re: [PATCH v2 9/9] reftable/stack: handle locked tables during auto-compaction

Justin Tobler <jltobler@xxxxxxxxx> · Tue, 6 Aug 2024 13:46:35 -0500

On 24/08/05 03:08PM, Patrick Steinhardt wrote:
> When compacting tables, it may happen that we want to compact a set of
> tables which are already locked by a concurrent process that compacts
> them. In the case where we wanted to perform a full compaction of all
> tables it is sensible to bail out in this case, as we cannot fulfill the
> requested action.
> 
> But when performing auto-compaction it isn't necessarily in our best
> interest of us to abort the whole operation. For example, due to the
> geometric compacting schema that we use, it may be that process A takes
> a lot of time to compact the bulk of all tables whereas process B
> appends a bunch of new tables to the stack. B would in this case also
> notice that it has to compact the tables that process A is compacting
> already and thus also try to compact the same range, probably including
> the new tables it has appended. But because those tables are locked
> already, it will fail and thus abort the complete auto-compaction. The
> consequence is that the stack will grow longer and longer while A isn't
> yet done with compaction, which will lead to a growing performance
> impact.

With this change, a concurrent compaction may create tables that violate
the geometric sequence post-compaction, but is already possible because
compaction doesn't take into account newly appended tables. It makes
sense to compact available tables to minimize performance impact.

> Instead of aborting auto-compaction altogether, let's gracefully handle
> this situation by instead compacting tables which aren't locked. To do
> so, instead of locking from the beginning of the slice-to-be-compacted,
> we start locking tables from the end of the slice. Once we hit the first
> table that is locked already, we abort. If we succeded to lock two or
> more tables, then we simply reduce the slice of tables that we're about
> to compact to those which we managed to lock.
> 
> This ensures that we can at least make some progress for compaction in
> said scenario. It also helps in other scenarios, like for example when a
> process died and left a stale lockfile behind. In such a case we can at
> least ensure some compaction on a best-effort basis.

At first I wondered best-effort compaction could further mask a stale
lockfile issue, but auto-compaction already attempted compaction gently
and did not report issues to begin with. So this is just a win. :)

> 
> Signed-off-by: Patrick Steinhardt <ps@xxxxxx>
> ---
>  reftable/stack.c           | 59 +++++++++++++++++++++++++++++++-------
>  reftable/stack_test.c      | 12 ++++----
>  t/t0610-reftable-basics.sh | 21 +++++++++-----
>  3 files changed, 70 insertions(+), 22 deletions(-)
> 
> diff --git a/reftable/stack.c b/reftable/stack.c
> index 51eb4470c7..b0721640e4 100644
> --- a/reftable/stack.c
> +++ b/reftable/stack.c
> @@ -999,6 +999,15 @@ static int stack_write_compact(struct reftable_stack *st,
>  	return err;
>  }
>  
> +enum stack_compact_range_flags {
> +	/*
> +	 * Perform a best-effort compaction. That is, even if we cannot lock
> +	 * all tables in the specified range, we will try to compact the
> +	 * remaining slice.
> +	 */
> +	STACK_COMPACT_RANGE_BEST_EFFORT = (1 << 0),
> +};
> +
>  /*
>   * Compact all tables in the range `[first, last)` into a single new table.
>   *
> @@ -1010,7 +1019,8 @@ static int stack_write_compact(struct reftable_stack *st,
>   */
>  static int stack_compact_range(struct reftable_stack *st,
>  			       size_t first, size_t last,
> -			       struct reftable_log_expiry_config *expiry)
> +			       struct reftable_log_expiry_config *expiry,
> +			       unsigned int flags)
>  {
>  	struct strbuf tables_list_buf = STRBUF_INIT;
>  	struct strbuf new_table_name = STRBUF_INIT;
> @@ -1052,19 +1062,47 @@ static int stack_compact_range(struct reftable_stack *st,
>  	/*
>  	 * Lock all tables in the user-provided range. This is the slice of our
>  	 * stack which we'll compact.
> +	 *
> +	 * Note that we lock tables in reverse order from last to first. The
> +	 * intent behind this is to allow a newer process to perform best
> +	 * effort compaction of tables that it has added in the case where an
> +	 * older process is still busy compacting tables which are preexisting
> +	 * from the point of view of the newer process.
>  	 */

Now that concurrent compaction of discrete table segments is possible,
locking the tables in reverse order allows a compaction process to
immediately mark the ending table of the segment and thus avoids
unnecessary competition between other concurrent compaction processes.

>  	REFTABLE_CALLOC_ARRAY(table_locks, last - first + 1);
> -	for (i = first; i <= last; i++) {
> -		stack_filename(&table_name, st, reader_name(st->readers[i]));
> +	for (i = last + 1; i > first; i--) {
> +		stack_filename(&table_name, st, reader_name(st->readers[i - 1]));

I might be missing something, but why not set `i = last` and `i >=
first`? It looks like everywhere we reference `i` we subtract one
anyways. Since `last` is already at the starting index, it seems it
would be a bit more straightforward.

>  
>  		err = hold_lock_file_for_update(&table_locks[nlocks],
>  						table_name.buf, LOCK_NO_DEREF);
>  		if (err < 0) {
> -			if (errno == EEXIST)
> +			/*
> +			 * When the table is locked already we may do a
> +			 * best-effort compaction and compact only the tables
> +			 * that we have managed to lock so far. This of course
> +			 * requires that we have been able to lock at least two
> +			 * tables, otherwise there would be nothing to compact.
> +			 * In that case, we return a lock error to our caller.
> +			 */
> +			if (errno == EEXIST && last - (i - 1) >= 2 &&
> +			    flags & STACK_COMPACT_RANGE_BEST_EFFORT) {
> +				err = 0;
> +				/*
> +				 * The subtraction is to offset the index, the
> +				 * addition is to only compact up to the table
> +				 * of the preceding iteration. They obviously
> +				 * cancel each other out, but that may be
> +				 * non-obvious when it was omitted.
> +				 */
> +				first = (i - 1) + 1;

I think we could solve potential confusion here by changing how `i` is
initially set also.

> +				break;
> +			} else if (errno == EEXIST) {
>  				err = REFTABLE_LOCK_ERROR;
> -			else
> +				goto done;
> +			} else {
>  				err = REFTABLE_IO_ERROR;
> -			goto done;
> +				goto done;
> +			}
>  		}
>  
>  		/*
> @@ -1308,9 +1346,10 @@ static int stack_compact_range(struct reftable_stack *st,
>  
>  static int stack_compact_range_stats(struct reftable_stack *st,
>  				     size_t first, size_t last,
> -				     struct reftable_log_expiry_config *config)
> +				     struct reftable_log_expiry_config *config,
> +				     unsigned int flags)
>  {
> -	int err = stack_compact_range(st, first, last, config);
> +	int err = stack_compact_range(st, first, last, config, flags);
>  	if (err == REFTABLE_LOCK_ERROR)
>  		st->stats.failures++;
>  	return err;
> @@ -1320,7 +1359,7 @@ int reftable_stack_compact_all(struct reftable_stack *st,
>  			       struct reftable_log_expiry_config *config)
>  {
>  	size_t last = st->merged->stack_len ? st->merged->stack_len - 1 : 0;
> -	return stack_compact_range_stats(st, 0, last, config);
> +	return stack_compact_range_stats(st, 0, last, config, 0);
>  }
>  
>  static int segment_size(struct segment *s)
> @@ -1427,7 +1466,7 @@ int reftable_stack_auto_compact(struct reftable_stack *st)
>  	reftable_free(sizes);
>  	if (segment_size(&seg) > 0)
>  		return stack_compact_range_stats(st, seg.start, seg.end - 1,
> -						 NULL);
> +						 NULL, STACK_COMPACT_RANGE_BEST_EFFORT);
>  
>  	return 0;
>  }
> diff --git a/reftable/stack_test.c b/reftable/stack_test.c
> index 2b1eb83934..733cf6113e 100644
> --- a/reftable/stack_test.c
> +++ b/reftable/stack_test.c
> @@ -913,13 +913,15 @@ static void test_reftable_stack_auto_compaction_with_locked_tables(void)
>  	write_file_buf(buf.buf, "", 0);
>  
>  	/*
> -	 * Ideally, we'd handle the situation where any of the tables is locked
> -	 * gracefully. We don't (yet) do this though and thus fail.
> +	 * When parts of the stack are locked, then auto-compaction does a best
> +	 * effort compaction of those tables which aren't locked. So while this
> +	 * would in theory compact all tables, due to the preexisting lock we
> +	 * only compact the newest two tables.
>  	 */
>  	err = reftable_stack_auto_compact(st);
> -	EXPECT(err == REFTABLE_LOCK_ERROR);
> -	EXPECT(st->stats.failures == 1);
> -	EXPECT(st->merged->stack_len == 5);
> +	EXPECT_ERR(err);
> +	EXPECT(st->stats.failures == 0);
> +	EXPECT(st->merged->stack_len == 4);
>  
>  	reftable_stack_destroy(st);
>  	strbuf_release(&buf);
> diff --git a/t/t0610-reftable-basics.sh b/t/t0610-reftable-basics.sh
> index b06c46999d..37510c2b2a 100755
> --- a/t/t0610-reftable-basics.sh
> +++ b/t/t0610-reftable-basics.sh
> @@ -478,19 +478,26 @@ test_expect_success "$command: auto compaction" '
>  
>  		test_oid blob17_2 | git hash-object -w --stdin &&
>  
> -		# Lock all tables write some refs. Auto-compaction will be
> -		# unable to compact tables and thus fails gracefully, leaving
> -		# the stack in a sub-optimal state.
> -		ls .git/reftable/*.ref |
> +		# Lock all tables, write some refs. Auto-compaction will be
> +		# unable to compact tables and thus fails gracefully,
> +		# compacting only those tables which are not locked.
> +		ls .git/reftable/*.ref | sort |
>  		while read table
>  		do
> -			touch "$table.lock" || exit 1
> +			touch "$table.lock" &&
> +			basename "$table" >>tables.expect || exit 1
>  		done &&
> +		test_line_count = 2 .git/reftable/tables.list &&
>  		git branch B &&
>  		git branch C &&
> -		rm .git/reftable/*.lock &&
> -		test_line_count = 4 .git/reftable/tables.list &&
>  
> +		# The new tables are auto-compacted, but the locked tables are
> +		# left intact.
> +		test_line_count = 3 .git/reftable/tables.list &&
> +		head -n 2 .git/reftable/tables.list >tables.head &&
> +		test_cmp tables.expect tables.head &&
> +
> +		rm .git/reftable/*.lock &&
>  		git $command --auto &&
>  		test_line_count = 1 .git/reftable/tables.list
>  	)
> -- 
> 2.46.0.dirty
>