Re: Repacking a repository uses up all available disk space

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sun, Jun 12, 2016 at 05:54:36PM -0400, Konstantin Ryabitsev wrote:

> >   git gc --prune=now
> 
> You are correct, this solves the problem, however I'm curious. The usual
> maintenance for these repositories is a regular run of:
> 
> - git fsck --full
> - git repack -Adl -b --pack-kept-objects
> - git pack-refs --all
> - git prune
> 
> The reason it's split into repack + prune instead of just gc is because
> we use alternates to save on disk space and try not to prune repos that
> are used as alternates by other repos in order to avoid potential
> corruption.
> 
> Am I not doing something that needs to be doing in order to avoid the
> same problem?

Your approach makes sense; we do the same thing at GitHub for the same
reasons[1]. The main thing you are missing that gc will do is that it
knows the prune-time it is going to feed to git-prune[2], and passes
that along to repack. That's what enables the "don't bother ejecting
these, because I'm about to delete them" optimization.

That option is not documented, because it was always assumed to be an
internal thing to git-gc, but it is:

  git repack ... --unpack-unreachable=5.minutes.ago

or whatever.

-Peff

[1] We don't run the fsck at the front, though, because it's really
    expensive.  I'm not sure it buys you much, either. The repack
    will do a full walk of the graph, so it gets you a connectivity
    check, as well as a full content check of the commits and trees. The
    blobs are copied as-is from the old pack, but there is a checksum on
    the pack data (to catch any bit flips by the disk storage). So the
    only thing the fsck is getting you is that it fully reconstructs the
    deltas for each blob and checks their sha1. That's more robust than
    a checksum, but it's a lot more expensive.

[2] It's unclear to me if you're passing any options to git-prune, but
    you may want to pass "--expire" with a short grace period. Without
    any options it prunes every unreachable thing, which can lead to
    races if the repository is actively being used.

    At GitHub we actually have a patch to `repack` that keeps all
    objects, reachable or not, in the pack, and use it for all of our
    automated maintenance. Since we don't drop objects at all, we can't
    ever have such a race. Aside from some pathological cases, it wastes
    much less space than you'd expect. We turn the flag off for special
    cases (e.g., somebody has rewound history and wants to expunge a
    sensitive object).

    I'm happy to share the "keep everything" patch if you're interested.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]