Re: [PATCH 01/17] Documentation/technical: add cruft-packs.txt

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Nov 29, 2021 at 7:29 PM Taylor Blau <me@xxxxxxxxxxxx> wrote:
>
> Create a technical document to explain cruft packs. It contains a brief
> overview of the problem, some background, details on the implementation,
> and a couple of alternative approaches not considered here.
>
> Signed-off-by: Taylor Blau <me@xxxxxxxxxxxx>
> ---
>  Documentation/Makefile                  |  1 +
>  Documentation/technical/cruft-packs.txt | 95 +++++++++++++++++++++++++
>  2 files changed, 96 insertions(+)
>  create mode 100644 Documentation/technical/cruft-packs.txt
>
> diff --git a/Documentation/Makefile b/Documentation/Makefile
> index ed656db2ae..0b01c9408e 100644
> --- a/Documentation/Makefile
> +++ b/Documentation/Makefile
> @@ -91,6 +91,7 @@ TECH_DOCS += MyFirstContribution
>  TECH_DOCS += MyFirstObjectWalk
>  TECH_DOCS += SubmittingPatches
>  TECH_DOCS += technical/bundle-format
> +TECH_DOCS += technical/cruft-packs
>  TECH_DOCS += technical/hash-function-transition
>  TECH_DOCS += technical/http-protocol
>  TECH_DOCS += technical/index-format
> diff --git a/Documentation/technical/cruft-packs.txt b/Documentation/technical/cruft-packs.txt
> new file mode 100644
> index 0000000000..bb54cce1b1
> --- /dev/null
> +++ b/Documentation/technical/cruft-packs.txt
> @@ -0,0 +1,95 @@
> += Cruft packs
> +
> +Cruft packs offer an alternative to Git's traditional mechanism of removing
> +unreachable objects. This document provides an overview of Git's pruning
> +mechanism, and how cruft packs can be used instead to accomplish the same.
> +
> +== Background
> +
> +To remove unreachable objects from your repository, Git offers `git repack -Ad`
> +(see linkgit:git-repack[1]). Quoting from the documentation:
> +
> +[quote]
> +[...] unreachable objects in a previous pack become loose, unpacked objects,
> +instead of being left in the old pack. [...] loose unreachable objects will be
> +pruned according to normal expiry rules with the next 'git gc' invocation.
> +
> +Unreachable objects aren't removed immediately, since doing so could race with
> +an incoming push which may reference an object which is about to be deleted.
> +Instead, those unreachable objects are stored as loose object and stay that way
> +until they are older than the expiration window, at which point they are removed
> +by linkgit:git-prune[1].
> +
> +Git must store these unreachable objects loose in order to keep track of their
> +per-object mtimes. If these unreachable objects were written into one big pack,
> +then either freshening that pack (because an object contained within it was
> +re-written) or creating a new pack of unreachable objects would cause the pack's
> +mtime to get updated, and the objects within it would never leave the expiration
> +window. Instead, objects are stored loose in order to keep track of the
> +individual object mtimes and avoid a situation where all cruft objects are
> +freshened at once.
> +
> +This can lead to undesirable situations when a repository contains many
> +unreachable objects which have not yet left the grace period. Having large
> +directories in the shards of `.git/objects` can lead to decreased performance in
> +the repository. But given enough unreachable objects, this can lead to inode
> +starvation and degrade the performance of the whole system. Since we
> +can never pack those objects, these repositories often take up a large amount of
> +disk space, since we can only zlib compress them, but not store them in delta
> +chains.
> +
> +== Cruft packs
> +
> +Cruft packs are designed to eliminate the need for storing unreachable objects
> +in a loose state by including the per-object mtimes in a separate file alongside
> +a single pack containing all loose objects.

I had the same question as Stolee here: why not use the cruft-pack's
mtime for all the objects in it?  Much later below, you make it clear
that a repository will generally only have one cruft pack which kind
of answers the question, but the repeated mention of "cruft packs"
throughout the document subtly made me make the opposite assumption.
It might be nice to address the almost-always-only-one-cruft-pack
earlier on, which may also help answer the question about why you need
to store individual mtimes in an additional file.

> +A cruft pack is written by `git repack --cruft` when generating a new pack.
> +linkgit:git-pack-objects[1]'s `--cruft` option. Note that `git repack --cruft`
> +is a classic all-into-one repack, meaning that everything in the resulting pack is
> +reachable, and everything else is unreachable. Once written, the `--cruft`
> +option instructs `git repack` to generate another pack containing only objects
> +not packed in the previous step (which equates to packing all unreachable
> +objects together). This progresses as follows:
> +
> +  1. Enumerate every object, marking any object which is (a) not contained in a
> +     kept-pack, and (b) whose mtime is within the grace period as a traversal
> +     tip.
> +
> +  2. Perform a reachability traversal based on the tips gathered in the previous
> +     step, adding every object along the way to the pack.
> +
> +  3. Write the pack out, along with a `.mtimes` file that records the per-object
> +     timestamps.
> +
> +This mode is invoked internally by linkgit:git-repack[1] when instructed to
> +write a cruft pack. Crucially, the set of in-core kept packs is exactly the set
> +of packs which will not be deleted by the repack; in other words, they contain
> +all of the repository's reachable objects.
> +
> +When a repository already has a cruft pack, `git repack --cruft` typically only
> +adds objects to it. An exception to this is when `git repack` is given the
> +`--cruft-expiration` option, which allows the generated cruft pack to omit
> +expired objects instead of waiting for linkgit:git-gc[1] to expire those objects
> +later on.
> +
> +It is linkgit:git-gc[1] that is typically responsible for removing expired
> +unreachable objects.
> +
> +== Alternatives
> +
> +Notable alternatives to this design include:
> +
> +  - The location of the per-object mtime data, and
> +  - Whether cruft packs should be incremental or not.
> +
> +On the location of mtime data, a new auxiliary file tied to the pack was chosen
> +to avoid complicating the `.idx` format. If the `.idx` format were ever to gain
> +support for optional chunks of data, it may make sense to consolidate the
> +`.mtimes` format into the `.idx` itself.
> +
> +Incremental cruft packs (i.e., where each time a repository is repacked a new
> +cruft pack is generated containing only the unreachable objects introduced since
> +the last time a cruft pack was written) are significantly more complicated to
> +construct, and so aren't pursued here. The obvious drawback to the current
> +implementation is that the entire cruft pack must be re-written from scratch.
> --
> 2.34.1.25.gb3157a20e6
>



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux