Re: [PATCH v3 01/17] Documentation/technical: add cruft-packs.txt

Jonathan Nieder <jrnieder@xxxxxxxxx> · Mon, 7 Mar 2022 10:03:35 -0800

Hi,

Taylor Blau wrote:

> Create a technical document to explain cruft packs. It contains a brief
> overview of the problem, some background, details on the implementation,
> and a couple of alternative approaches not considered here.

Sorry for the very slow review!  I've mentioned a few times that this
overlaps in interesting ways with the gc mechanism described in
hash-function-transition.txt, so I'd like to compare and see how they
interact.

[...]
> --- /dev/null
> +++ b/Documentation/technical/cruft-packs.txt
> @@ -0,0 +1,97 @@
[...]
> +Unreachable objects aren't removed immediately, since doing so could race with
> +an incoming push which may reference an object which is about to be deleted.
> +Instead, those unreachable objects are stored as loose object and stay that way
> +until they are older than the expiration window, at which point they are removed
> +by linkgit:git-prune[1].
> +
> +Git must store these unreachable objects loose in order to keep track of their
> +per-object mtimes.

It's worth noting that this behavior is already racy.  That is because
when an unreachable object becomes newly reachable, we do not update
its mtime and the mtimes of every object reachable from it, so if it
then becomes transiently unreachable again then it can be wrongly
collected.

[...]
>                                these repositories often take up a large amount of
> +disk space, since we can only zlib compress them, but not store them in delta
> +chains.

Yes!  I'm happy we're making progress on this.

> +
> +== Cruft packs
> +
> +A cruft pack eliminates the need for storing unreachable objects in a loose
> +state by including the per-object mtimes in a separate file alongside a single
> +pack containing all loose objects.

Can this doc say a little about how "git prune" handles these files?
In particular, does a non cruft pack aware copy of Git (or JGit,
libgit2, etc) do the right thing or does it fight with this mechanism?
If the latter, do we have a repository extension (extensions.*) to
prevent that?

[...]
> +  3. Write the pack out, along with a `.mtimes` file that records the per-object
> +     timestamps.

As a point of comparison, the design in hash-function-transition uses
a single timestamp for the whole pack.  During read operations, objects
in a cruft pack are considered present; during writes, they are
considered _not present_ so that if we want to make a cruft object
newly present then we put a copy of it in a new pack.

Advantage of the mtimes file approach:
- less duplication of storage: a revived object is only stored once,
  in a cruft pack, and then the next gc can "graduate" it out of the
  cruft pack and shrink the cruft pack
- less affect on non-gc Git code: writes don't need to know that any
  cruft objects referenced need to be copied into a new pack

Advantages of the mtime per cruft pack approach:
- easy expiration: once a cruft pack has reached its expiration date,
  it can be deleted as a whole
- less I/O churn: a cruft pack stays as-is until combined into another
  cruft pack or deleted.  There is no frequently-modified mtimes file
  associated to it
- informs the storage layer about what is likely to be accessed: cruft
  packs can get filesystem attributes to put them in less-optimized
  storage since they are likely to be less frequently read

[...]
> +Notable alternatives to this design include:
> +
> +  - The location of the per-object mtime data, and
> +  - Storing unreachable objects in multiple cruft packs.
> +
> +On the location of mtime data, a new auxiliary file tied to the pack was chosen
> +to avoid complicating the `.idx` format. If the `.idx` format were ever to gain
> +support for optional chunks of data, it may make sense to consolidate the
> +`.mtimes` format into the `.idx` itself.
> +
> +Storing unreachable objects among multiple cruft packs (e.g., creating a new
> +cruft pack during each repacking operation including only unreachable objects
> +which aren't already stored in an earlier cruft pack) is significantly more
> +complicated to construct, and so aren't pursued here. The obvious drawback to
> +the current implementation is that the entire cruft pack must be re-written from
> +scratch.

This doesn't mention the approach described in
hash-function-transition.txt (and that's already implemented and has
been in use for many years in JGit's DfsRepository).  Does that mean
you aren't aware of it?

Thanks,
Jonathan