Re: [PATCH v4 0/8] repack: support repacking into a geometric sequence

Jeff King <peff@xxxxxxxx> · Tue, 23 Feb 2021 16:53:21 -0500

On Tue, Feb 23, 2021 at 02:41:09PM -0700, Martin Fick wrote:

> > > Would it make sense to somehow detect all new packs since the last rollup
> > > and always include them in the rollup no matter what their size? That is
> > > one thing that my git-exproll script did. One of the main reasons to do
> > > this was because newer packs tended to look big (I was using bytes to
> > > determine size), and newer packs were often bigger on disk compared to
> > > other packs with similar objects in them (I think you suggested this was
> > > due to the thickening of packs on receipt). Maybe roll up all packs with
> > > a timestamp "new enough", no matter how big they are?
> > 
> > That works against the "geometric" part of the strategy, which is trying
> > to roll up in a sequence that is amortized-linear. I.e., we are not
> > always rolling up everything outside of the base pack, but trying to
> > roll up little into medium, and then eventually medium into large. If
> > you roll up things that are "too big", then you end up rewriting the
> > bytes more often, and your amount of work becomes super-linear.
> 
> I'm not sure I follow, it would seem to me that it would stay linear, and be 
> at most rewriting each new packfile once more than previously? Are you 
> envisioning more work than that?

Maybe I don't understand what you're proposing.

The idea of the geometric repack is that by sorting by size and then
finding a "cutoff" within the size array, we can make sure that we roll
up a sufficiently small number of bytes in each roll-up that it ends up
linear in the size of the repo in the long run. But if we roll up
without regard to size, then our worst case is that the biggest pack is
the newest (imagine a repo with 10 small pushes and then one gigantic
one). So we roll that up with some small packs, doing effectively
O(size_of_repo) work. And then in the next roll up we do it again, and
so on. So we end up with O(size_of_repo * nr_rollups) total work. Which
is no better than having just done a full repack at each rollup.

Now I don't think we'd see that worst case in practice that much. And
depending on your definition of "new enough", you might keep nr_rollups
pretty small.

-Peff