Re: [PATCH 03/12] reftable/merged: advance subiter on subsequent iteration

Justin Tobler <jltobler@xxxxxxxxx> · Tue, 20 Feb 2024 12:25:10 -0600

On 24/02/14 08:45AM, Patrick Steinhardt wrote:
> When advancing the merged iterator, we pop the top-most entry from its

s/top-most/topmost

> priority queue and then advance the sub-iterator that the entry belongs
> to, adding the result as a new entry. This is quite sensible in the case
> where the merged iterator is used to actual iterate through records. But

s/actual/actually

> the merged iterator is also used when we look up a single record, only,
> so advancing the sub-iterator is wasted effort because we would never
> even look at the result.
> 
> Instead of immediately advancing the sub-iterator, we can also defer
> this to the next iteration of the merged iterator by storing the
> intent-to-advance. This results in a small speedup when reading many
> records. The following benchmark creates 10000 refs, which will also end
> up with many ref lookups:
> 
>     Benchmark 1: update-ref: create many refs (revision = HEAD~)
>       Time (mean ± σ):     337.2 ms ±   7.3 ms    [User: 200.1 ms, System: 136.9 ms]
>       Range (min … max):   329.3 ms … 373.2 ms    100 runs
> 
>     Benchmark 2: update-ref: create many refs (revision = HEAD)
>       Time (mean ± σ):     332.5 ms ±   5.9 ms    [User: 197.2 ms, System: 135.1 ms]
>       Range (min … max):   327.6 ms … 359.8 ms    100 runs
> 
>     Summary
>       update-ref: create many refs (revision = HEAD) ran
>         1.01 ± 0.03 times faster than update-ref: create many refs (revision = HEAD~)
> 
> While this speedup alone isn't really worth it, this refactoring will
> also allow two additional optimizations in subsequent patches. First, it
> will allow us to special-case when there is only a single sub-iter left
> to circumvent the priority queue altogether. And second, it makes it
> easier to avoid copying records to the caller.
> 
> Signed-off-by: Patrick Steinhardt <ps@xxxxxx>
> ---
>  reftable/merged.c | 26 ++++++++++++--------------
>  1 file changed, 12 insertions(+), 14 deletions(-)
> 
> diff --git a/reftable/merged.c b/reftable/merged.c
> index 12ebd732e8..9b1ccfff00 100644
> --- a/reftable/merged.c
> +++ b/reftable/merged.c
> @@ -19,11 +19,12 @@ license that can be found in the LICENSE file or at
>  
>  struct merged_iter {
>  	struct reftable_iterator *stack;
> +	struct merged_iter_pqueue pq;
>  	uint32_t hash_id;
>  	size_t stack_len;
>  	uint8_t typ;
>  	int suppress_deletions;
> -	struct merged_iter_pqueue pq;
> +	ssize_t advance_index;
>  };
>  
>  static int merged_iter_init(struct merged_iter *mi)
> @@ -96,13 +97,17 @@ static int merged_iter_next_entry(struct merged_iter *mi,
>  	struct pq_entry entry = { 0 };
>  	int err = 0;
>  
> +	if (mi->advance_index >= 0) {
> +		err = merged_iter_advance_subiter(mi, mi->advance_index);
> +		if (err < 0)
> +			return err;
> +		mi->advance_index = -1;
> +	}
> +

Without additional context, it isn't immediately clear to me why the
sub-iterator is condionally advanced at the beginning. Maybe a comment
could be added to explain as done in the commit message to help with
clarity?

>  	if (merged_iter_pqueue_is_empty(mi->pq))
>  		return 1;
>  
>  	entry = merged_iter_pqueue_remove(&mi->pq);
> -	err = merged_iter_advance_subiter(mi, entry.index);
> -	if (err < 0)
> -		return err;
>  
>  	/*
>  	  One can also use reftable as datacenter-local storage, where the ref
> @@ -116,14 +121,6 @@ static int merged_iter_next_entry(struct merged_iter *mi,
>  		struct pq_entry top = merged_iter_pqueue_top(mi->pq);
>  		int cmp;
>  
> -		/*
> -		 * When the next entry comes from the same queue as the current
> -		 * entry then it must by definition be larger. This avoids a
> -		 * comparison in the most common case.
> -		 */
> -		if (top.index == entry.index)
> -			break;
> -

I'm not quite sure I follow by the above check is removed as part of
this change. Would you mind clarifying?

-Justin