Re: [PATCH v2 10/18] maintenance: add incremental-repack task

Junio C Hamano <gitster@xxxxxxxxx> · Thu, 23 Jul 2020 15:00:11 -0700

"Derrick Stolee via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes:

> 1. 'git multi-pack-index write' creates a multi-pack-index file if
>    one did not exist, and otherwise will update the multi-pack-index
>    with any new pack-files that appeared since the last write. This
>    is particularly relevant with the background fetch job.
>
>    When the multi-pack-index sees two copies of the same object, it
>    stores the offset data into the newer pack-file. This means that
>    some old pack-files could become "unreferenced" which I will use
>    to mean "a pack-file that is in the pack-file list of the
>    multi-pack-index but none of the objects in the multi-pack-index
>    reference a location inside that pack-file."

An obvious alternative is to favor the copy in the older pack,
right?  Is the expectation that over time, most of the objects that
are relevant would reappear in newer packs, so that eventually by
favoring the copies in the newer packs, we can retire and remove the
old pack, keeping only the newer ones?

But would that assumption hold?  The old packs hold objects that are
necessary for the older parts of the history, so unless you are
cauterizing away the old history, these objects in the older packs
are likely to stay with us longer than those used by the newer parts
of the history, some of which may not even have been pushed out yet
and can be rebased away?

> 2. 'git multi-pack-index expire' deletes any unreferenced pack-files
>    and updaes the multi-pack-index to drop those pack-files from the
>    list. This is safe to do as concurrent Git processes will see the
>    multi-pack-index and not open those packs when looking for object
>    contents. (Similar to the 'loose-objects' job, there are some Git
>    commands that open pack-files regardless of the multi-pack-index,
>    but they are rarely used. Further, a user that self-selects to
>    use background operations would likely refrain from using those
>    commands.)

OK.

> 3. 'git multi-pack-index repack --bacth-size=<size>' collects a set
>    of pack-files that are listed in the multi-pack-index and creates
>    a new pack-file containing the objects whose offsets are listed
>    by the multi-pack-index to be in those objects. The set of pack-
>    files is selected greedily by sorting the pack-files by modified
>    time and adding a pack-file to the set if its "expected size" is
>    smaller than the batch size until the total expected size of the
>    selected pack-files is at least the batch size. The "expected
>    size" is calculated by taking the size of the pack-file divided
>    by the number of objects in the pack-file and multiplied by the
>    number of objects from the multi-pack-index with offset in that
>    pack-file. The expected size approximats how much data from that

approximates.

>    pack-file will contribute to the resulting pack-file size. The
>    intention is that the resulting pack-file will be close in size
>    to the provided batch size.

> +static int maintenance_task_incremental_repack(void)
> +{
> +	if (multi_pack_index_write()) {
> +		error(_("failed to write multi-pack-index"));
> +		return 1;
> +	}
> +
> +	if (multi_pack_index_verify()) {
> +		warning(_("multi-pack-index verify failed after initial write"));
> +		return rewrite_multi_pack_index();
> +	}
> +
> +	if (multi_pack_index_expire()) {
> +		error(_("multi-pack-index expire failed"));
> +		return 1;
> +	}
> +
> +	if (multi_pack_index_verify()) {
> +		warning(_("multi-pack-index verify failed after expire"));
> +		return rewrite_multi_pack_index();
> +	}
> +	if (multi_pack_index_repack()) {
> +		error(_("multi-pack-index repack failed"));
> +		return 1;
> +	}

Hmph, I wonder if these warning should come from each helper
functions that are static to this function anyway.

It also makes it easier to reason about this function by eliminating
the need for having a different pattern only for the verify helper.
Instead, verify could call rewrite internally when it notices a
breakage.  I.e.

	if (multi_pack_index_write())
		return 1;
	if (multi_pack_index_verify("after initial write"))
		return 1;
	if (multi_pack_index_exire())
		return 1;
	...

Also, it feels odd, compared to our internal API convention, that
positive non-zero is used as an error here.

> +	return 0;
> +}
> +
>  typedef int maintenance_task_fn(void);
>  
>  struct maintenance_task {
> @@ -1037,6 +1152,10 @@ static void initialize_tasks(void)
>  	tasks[num_tasks]->fn = maintenance_task_loose_objects;
>  	num_tasks++;
>  
> +	tasks[num_tasks]->name = "incremental-repack";
> +	tasks[num_tasks]->fn = maintenance_task_incremental_repack;
> +	num_tasks++;
> +
>  	tasks[num_tasks]->name = "gc";
>  	tasks[num_tasks]->fn = maintenance_task_gc;
>  	tasks[num_tasks]->enabled = 1;

Exactly the same comment as 08/18 about natural/inherent ordering
applies here as well.

Thanks.