Re: [PATCH] t1006: add tests for %(objectsize:disk)

René Scharfe <l.s.r@xxxxxx> · Sun, 24 Dec 2023 10:30:17 +0100

Am 23.12.23 um 11:18 schrieb Jeff King:
> On Fri, Dec 22, 2023 at 12:13:10AM +0100, René Scharfe wrote:
>
>>>> Should we deduplicate here, like cat-file does (i.e. use "sort -u")?
>>>> Having the same object in multiple places for whatever reason would not
>>>> be a cause for reporting an error in this test, I would think.
>>>
>>> No, for the reasons I said in the commit message: if an object exists in
>>> multiple places the test is already potentially invalid, as Git does not
>>> promise which version it will use. So it might work racily, or it might
>>> work for now but be fragile. By not de-duplicating, we make sure the
>>> test's assumption holds.
>>
>> Oh, skipped that paragraph.  Still I don't see how a duplicate object
>> would necessarily invalidate t1006.  The comment for the test "cat-file
>> --batch-all-objects shows all objects" a few lines above indicates that
>> it's picky about the provenance of objects, but it uses a separate
>> repository.  I can't infer the same requirement for the root repo, but
>> we already established that I can't read.
>
> The cat-file documentation explicitly calls this situation out:
>
>   Note also that multiple copies of an object may be present in the
>   object database; in this case, it is undefined which copy’s size or
>   delta base will be reported.
>
> So if t1006 were to grow such a duplicate object, what will happen? If
> we de-dup in the new test, then we might end up mentioning the same copy
> (and the test passes), or we might not (and the test fails). But much
> worse, the results might be racy (depending on how cat-file happens to
> decide which one to use). By no de-duping, then the test will reliably
> fail and the author can decide how to handle it then.
>
> IOW it is about failing immediately and predictably rather than letting
> a future change to sneak a race or other accident-waiting-to-happen into
> t1006.
>
>> Anyway, if someone finds a use for git repack without -d or
>> git unpack-objects or whatever else causes duplicates in the root
>> repository of t1006 then they can try to reverse your ban with concrete
>> arguments.
>
> In the real world, the most common way to get a duplicate is to fetch or
> push into a repository, such that:
>
>   1. There are enough objects to retain the pack (100 by default)
>
>   2. There's a thin delta in the on-the-wire pack (i.e., a delta against
>      a base that the sender knows the receiver has, but which isn't
>      itself sent).
>
> Then "index-pack --fix-thin" will complete the on-disk pack by storing a
> copy of the base object in it. And now we have it in two packs (and if
> it's a delta or loose in the original, the size will be different).

I think I get it now.  The size possibly being different is crucial.
cat-file deduplicates based on object ID alone.  sort -u in t1006 would
deduplicate based on object ID and size, meaning that it would only
remove duplicates of the same size.  Emulating the deduplication of
cat-file is also possible, but would introduce the race you mentioned.

However, even removing only same-size duplicates is unreliable because
there is no guarantee that the same object has the same size in
different packs.  Adding a new object that is a better delta base would
change the size.

So, deduplicating based on object ID and size is sound for any
particular run, but sizes are not stable and thus we need to know if
the tests do something that adds duplicates of any size.

René