Re: [PATCH 1/8] midx: expose 'write_midx_file_only()' publicly

Taylor Blau <me@xxxxxxxxxxxx> · Sat, 11 Sep 2021 12:17:30 -0400

On Fri, Sep 10, 2021 at 10:00:26PM -0700, Junio C Hamano wrote:
> Taylor Blau <me@xxxxxxxxxxxx> writes:
>
> >  	if (ends_with(file_name, ".idx")) {
> >  		display_progress(ctx->progress, ++ctx->pack_paths_checked);
> > -		if (ctx->m && midx_contains_pack(ctx->m, file_name))
> > -			return;
> > +		if (ctx->m) {
> > +			if (midx_contains_pack(ctx->m, file_name))
> > +				return;
> > +		} else if (ctx->to_include) {
> > +			if (!string_list_has_string(ctx->to_include, file_name))
> > +				return;
>
> What's the expected number of elements on the to_include list?  I am
> wondering about the performance implications of using linear search
> over the string-list, of course.  Is it about the same order of the
> number of packfiles in a repository (up to several dozens, or 1000
> at most unless you are insane, or something like that)?

You're definitely in the right ballpark. It depends on the repack
settings and size of repository, of course, but I imagine that roughly
1,000 entries would be the most anybody could ever pass (e.g., during a
`--geometric` repack, the biggest pack would have to contain 2^1000
times as many objects as the smallest pack).

Of course, you could just constantly be adding packs and doing
incremental `git repack -d --write-midx`. Seems unlikely to me, but if
it does become a problem we could easily read the values into a hashmap
and constant-ize the lookup.

But the scan is logarithmic, not linear, since the string list is
sorted.

Thanks,
Taylor