Re: [PATCH 00/17] cruft packs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Dec 03, 2021 at 11:51:51AM -0800, Junio C Hamano wrote:
> Taylor Blau <me@xxxxxxxxxxxx> writes:
>
> > This series implements "cruft packs", a pack which stores accumulated
> > unreachable objects, along with a new ".mtimes" file which tracks each
> > object's last known modification time.
>
> Let me rephrase the above to test my understanding, since I need to
> write a summary for the  "What's cooking" report.
>
>  Instead of leaving unreachable objects in loose form when packing,
>  or ejecting them into loose form when repacking, gather them in a
>  packfile with an auxiliary file that records the last-use time of
>  these objects.

Exactly. Thanks for such a concise and accurate description of the
topic.

> That way, we do not have to waste so many inodes for loose objects
> that is not likely to be used, which feels like a win.

Yes. This had historically been a problem for GitHub. We don't
automatically prune unreachable objects during repacking, but sometimes
customers will ask us to do it on their behalf (if, for example, they
accidentally pushed sensitive information to us, and then force-pushed
over it).

But occasionally we'd get bitten by exploding many years of loose
objects (because we used to freshen packfiles too aggressively when
moving them around).

We've been running this series in production for the past few months,
and it's been a huge relief on the folks who typically run these pruning
GCs.

> >   - The final patch handles object freshening for objects stored in a
> >     cruft pack.
>
> I am not going to read it today, but I think this is the most
> interesting part of the series.  Instead of using mtime of an
> individual loose object file, we'd need to record the time of
> last use for each object in a pack.
>
> Stepping back a bit, I do not see how we can get away without doing
> the same .mtimes file for non-cruft packs.  An object that is in a
> non-cruft pack may be referenced immediately after the repack that
> created the pack, but the ref that was referencing the object may
> have gone away and now the pack is a month old.  If we were to
> repack the object, we do not know when was the last time the object
> was reachable from any of the refs and index entries (collectively
> known as anchor points).

In that situation, we would use the mtime of the pack which contains
that object itself as a proxy (or the mtime of a loose copy of the
object, if it is more recent).

That isn't perfect, as you note, since if the pack isn't otherwise
freshened, we'd consider that object to be a month old, even if the
reference pointing at it was deleted a mere second ago.

I can't recall if Peff and I talked about this off-list, but I have a
vague sense we probably did (and I forgot the details).

> Of course, recording all mtimes for all
> packed objects all the time would involve quite a lot of overhead.
> I am guessing (I will not spend time today to figure it out myself)
> that .mtimes update at runtime will happen in-place (i.e. via
> seek(2)+write(2), or pwrite()), and I wonder what the safety concern
> would be (which is the primary reason why we tend not to do in-place
> updates but recreate-and-rename updates).

Yeah, this series avoids doing an in-place update, and similarly avoids
recreating the entire .mtimes file before moving into place. Instead,
freshening an object stored in a cruft pack takes place by rewriting a
copy of the object loose, since we consider an object's mtime to be the
most recent of (a) what's in the .mtimes file, (b) the mtime of the
containing pack, and (c) the mtime of a loose copy (if one exists).

It can be wasteful, but in practice "resurrecting" an object in a cruft
pack is pretty rare, so on balance it ends up costing less work to do.

> Thanks for working on such an interesting topic.

I'm glad to have piqued your interest.

Taylor



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux