Re: [PATCH 05/24] midx: implement `DISP` chunk

Junio C Hamano <gitster@xxxxxxxxx> · Sun, 03 Dec 2023 22:15:11 +0900

Taylor Blau <me@xxxxxxxxxxxx> writes:

> diff --git a/Documentation/git-multi-pack-index.txt b/Documentation/git-multi-pack-index.txt
> index 3696506eb3..d130e65b28 100644
> --- a/Documentation/git-multi-pack-index.txt
> +++ b/Documentation/git-multi-pack-index.txt
> @@ -49,6 +49,10 @@ write::
>  	--stdin-packs::
>  		Write a multi-pack index containing only the set of
>  		line-delimited pack index basenames provided over stdin.
> +		Lines beginning with a '+' character (followed by the
> +		pack index basename as before) have their pack marked as
> +		"disjoint". See the "`DISP` chunk and disjoint packs"
> +		section in linkgit:gitformat-pack[5] for more.

Makes one wonder who computes the set of packfiles, decides to
prefix '+' to which ones, and how it does so, none of which appear
in this step (which is understandable).  As the flow of information
is from the producer of individual "disjoint" packs (not in this
step) to this new logic in "--stdin-packs" to the new "DISP" chunk
writer (the primary focus of this step) to the final consumer of
"DISP" chunk (not in this step), we are digging from the middle
(hopefully to both directions in other steps).  It is probably the
easiest way to explain the idea to start from the primary data
structures and "DISP" seems to be a good place to start.

> +	    Two packs are "disjoint" with respect to one another when they have
> +	    disjoint sets of objects.
> + In other words, any object found in a pack
> +	    contained in the set of disjoint packfiles is guaranteed to be
> +	    uniquely located among those packs.

I often advise people to rethink what they wrote _before_ "In other
words", because the use of that phrase is a sign that the author
considers the statement is hard to grok and needs rephrasing, in
which case, the rephrased version may be a better way to explain the
concept being presented without the harder-to-grok version.

But I do not think this one is a good example to apply the advice.
It is because "In other words," is somewhat misused in the sentence.
Two "disjoint" packs do not store any common object (which is how
you defined the adjective "disjoint" in the first sentence).  "As a
consequence"/"Hence", an object found in one pack among many
"disjoint" packs will not appear in others.

By the way, how strict does this disjointness have to be?

Let's examine an extreme case.  When you have two packs that are
"mostly" disjoint, but have one single object in common, how would
that object interfere with the bulk streaming of existing packdata
out of these two packs?  Would we be able to, say, safely pretend
that the problematic single object lives only in one but not in the
other (in other words, can we safely "ignore" the presence of the
copy in the other pack)?  I think it would break down if that
ignored copy is used as a delta base of another object in the same
pack, and the base object for the delta is recorded as OFS_DELTA
(which most likely every delta is these days), because we no longer
can stream out such deltified object without re-pointing its base to
the other copy, which will in turn screw up the relative offset of
other objects in the same stream.

OK, so it seems they really need to be strictly disjoint in order to
participate in the reuse of the existing packdata.  

> +When a chunk of bytes are reused from an existing pack, any objects
> +contained therein do not need to be added to the packing list, saving
> +memory and CPU time. But a chunk from an existing packfile can only be
> +reused when the following conditions are met:
> +
> +  - The chunk contains only objects which were requested by the caller
> +    (i.e. does not contain any objects which the caller didn't ask for
> +    explicitly or implicitly).

OK.

> +  - All objects stored as offset- or reference-deltas also include their
> +    base object in the resulting pack.

Are thin packs obsolete?

> diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
> index c4c6060cee..fd24e0c952 100755
> --- a/t/t5319-multi-pack-index.sh
> +++ b/t/t5319-multi-pack-index.sh
> @@ -1157,4 +1157,62 @@ test_expect_success 'reader notices too-small revindex chunk' '
>  	test_cmp expect.err err
>  '
>  
> +test_expect_success 'disjoint packs are stored via the DISP chunk' '
> +	test_when_finished "rm -fr repo" &&
> +	git init repo &&
> +	(
> +		cd repo &&
> +
> +		for i in 1 2 3 4 5
> +		do
> +			test_commit "$i" &&
> +			git repack -d || return 1
> +		done &&
> +
> +		find $objdir/pack -type f -name "*.idx" | xargs -n 1 basename | sort >packs &&

That is an overly-long line.

> +test_expect_success 'non-disjoint packs are detected' '
> +	test_when_finished "rm -fr repo" &&
> +	git init repo &&
> +	(
> +		cd repo &&
> +
> +		test_commit base &&
> +		git repack -d &&
> +		test_commit other &&
> +		git repack -a &&
> +
> +		ls -la .git/objects/pack/ &&

Is this line a leftover debugging aid?

> +		find $objdir/pack -type f -name "*.idx" |
> +			sed -e "s/.*\/\(.*\)$/+\1/g" >in &&

Lose "g"; it adds unnecessary cognitive burden to the readers if the
patterh is expected to match multiple times, and you know that is
not possible (your pattern is right anchored at the end).  This may
apply equally to other uses of "sed" in this patch.

> +		test_must_fail git multi-pack-index write --stdin-packs \
> +			--bitmap <in 2>err &&
> +		grep "duplicate object.* among disjoint packs" err
> +	)
> +'
> +
>  test_done