Re: [PATCH v2 10/18] maintenance: add incremental-repack task

Derrick Stolee <stolee@xxxxxxxxx> · Fri, 24 Jul 2020 11:03:26 -0400

On 7/23/2020 6:00 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes:
> 
>> 1. 'git multi-pack-index write' creates a multi-pack-index file if
>>    one did not exist, and otherwise will update the multi-pack-index
>>    with any new pack-files that appeared since the last write. This
>>    is particularly relevant with the background fetch job.
>>
>>    When the multi-pack-index sees two copies of the same object, it
>>    stores the offset data into the newer pack-file. This means that
>>    some old pack-files could become "unreferenced" which I will use
>>    to mean "a pack-file that is in the pack-file list of the
>>    multi-pack-index but none of the objects in the multi-pack-index
>>    reference a location inside that pack-file."
> 
> An obvious alternative is to favor the copy in the older pack,
> right?  Is the expectation that over time, most of the objects that
> are relevant would reappear in newer packs, so that eventually by
> favoring the copies in the newer packs, we can retire and remove the
> old pack, keeping only the newer ones?
> 
> But would that assumption hold?  The old packs hold objects that are
> necessary for the older parts of the history, so unless you are
> cauterizing away the old history, these objects in the older packs
> are likely to stay with us longer than those used by the newer parts
> of the history, some of which may not even have been pushed out yet
> and can be rebased away?

If we created a new pack-file containing an already-packed object,
then shouldn't we assume that the new pack-file does a _better_ job
of compressing that object? Or at least, doesn't make it worse?

For example, if we use 'git multi-pack-index repack --batch-size=0',
then this creates a new pack-file containing every previously-packed
object. This new pack-file should have better delta compression than
the previous setup across multiple pack-files. We want this new
pack-file to be used, not the old ones.

This "pick the newest pack" strategy is also what allows us to safely
use the 'expire' option to drop old pack-files. If we always keep the
old copies, then when we try to switch the new pack-files, we cannot
delete the old packs safely because a concurrent Git process could
be trying to reference it. One race is as follows:

 * Process A opens the multi-pack-index referring to old pack P. It
   doesn't open the pack-file as it hasn't tried to parse objects yet.

 * Process B is 'git multi-pack-index expire'. It sees that pack P can
   be dropped because all objects appear in newer pack-files. It deletes
   P.

 * Process A tries to read from P, and this fails. A must reprepare its
   representation of the pack-files.

This is the less disruptive race since A can recover with a small cost
to its performance. The worse race (on Windows) is this:

 * Process A loads the multi-pack-index and tries to parse an object by
   loading "old" pack P.

 * Process B tries to delete P. However, on Windows the handle to P by
   A prevents the deletion.

At this point, there could be two resolutions. The first is to have
the 'expire' fail because we can't delete A. This means we might never
delete A in a busy repository. The second is that the 'expire' command
continues and drops A from the list in the multi-pack-index. However,
now all Git processes add A to the packed_git list because it isn't
referenced by the multi-pack-index.

In summary: the decision to pick the newer copies is a fundamental
part of how the write->expire->repack loop was designed.

>> 2. 'git multi-pack-index expire' deletes any unreferenced pack-files
>>    and updaes the multi-pack-index to drop those pack-files from the
>>    list. This is safe to do as concurrent Git processes will see the
>>    multi-pack-index and not open those packs when looking for object
>>    contents. (Similar to the 'loose-objects' job, there are some Git
>>    commands that open pack-files regardless of the multi-pack-index,
>>    but they are rarely used. Further, a user that self-selects to
>>    use background operations would likely refrain from using those
>>    commands.)
> 
> OK.
> 
>> 3. 'git multi-pack-index repack --bacth-size=<size>' collects a set
>>    of pack-files that are listed in the multi-pack-index and creates
>>    a new pack-file containing the objects whose offsets are listed
>>    by the multi-pack-index to be in those objects. The set of pack-
>>    files is selected greedily by sorting the pack-files by modified
>>    time and adding a pack-file to the set if its "expected size" is
>>    smaller than the batch size until the total expected size of the
>>    selected pack-files is at least the batch size. The "expected
>>    size" is calculated by taking the size of the pack-file divided
>>    by the number of objects in the pack-file and multiplied by the
>>    number of objects from the multi-pack-index with offset in that
>>    pack-file. The expected size approximats how much data from that
> 
> approximates.
> 
>>    pack-file will contribute to the resulting pack-file size. The
>>    intention is that the resulting pack-file will be close in size
>>    to the provided batch size.
> 
>> +static int maintenance_task_incremental_repack(void)
>> +{
>> +	if (multi_pack_index_write()) {
>> +		error(_("failed to write multi-pack-index"));
>> +		return 1;
>> +	}
>> +
>> +	if (multi_pack_index_verify()) {
>> +		warning(_("multi-pack-index verify failed after initial write"));
>> +		return rewrite_multi_pack_index();
>> +	}
>> +
>> +	if (multi_pack_index_expire()) {
>> +		error(_("multi-pack-index expire failed"));
>> +		return 1;
>> +	}
>> +
>> +	if (multi_pack_index_verify()) {
>> +		warning(_("multi-pack-index verify failed after expire"));
>> +		return rewrite_multi_pack_index();
>> +	}
>> +	if (multi_pack_index_repack()) {
>> +		error(_("multi-pack-index repack failed"));
>> +		return 1;
>> +	}
> 
> Hmph, I wonder if these warning should come from each helper
> functions that are static to this function anyway.
> 
> It also makes it easier to reason about this function by eliminating
> the need for having a different pattern only for the verify helper.
> Instead, verify could call rewrite internally when it notices a
> breakage.  I.e.
> 
> 	if (multi_pack_index_write())
> 		return 1;
> 	if (multi_pack_index_verify("after initial write"))
> 		return 1;
> 	if (multi_pack_index_exire())
> 		return 1;
> 	...

This is a cleaner model. I'll work on that.

> Also, it feels odd, compared to our internal API convention, that
> positive non-zero is used as an error here.
...
> Exactly the same comment as 08/18 about natural/inherent ordering
> applies here as well.

I'll leave these to be resolved in the earlier messages.

Thanks,
-Stolee