On Tue, Feb 23, 2021 at 12:54:56PM -0700, Martin Fick wrote: > > Yeah, this is definitely a heuristic that can get out of sync with > > reality. I think in general if you have base pack A and somebody pushes > > up B, C, and D in sequence, we're likely to roll up a single DBC (in > > that order) pack. Further pushes E, F, G would have newer mtimes. So we > > might get GFEDBC directly. Or we might get GFE and DBC, but the former > > would still have a newer mtime, so we'd create GFEDBC on the next run. > > > > The issues come from: > > > > - we are deciding what to roll up based on size. A big push might not > > get rolled up immediately, putting it out-of-sync with the rest of > > the rollups. > > Would it make sense to somehow detect all new packs since the last rollup and > always include them in the rollup no matter what their size? That is one thing > that my git-exproll script did. One of the main reasons to do this was because > newer packs tended to look big (I was using bytes to determine size), and > newer packs were often bigger on disk compared to other packs with similar > objects in them (I think you suggested this was due to the thickening of packs > on receipt). Maybe roll up all packs with a timestamp "new enough", no matter > how big they are? That works against the "geometric" part of the strategy, which is trying to roll up in a sequence that is amortized-linear. I.e., we are not always rolling up everything outside of the base pack, but trying to roll up little into medium, and then eventually medium into large. If you roll up things that are "too big", then you end up rewriting the bytes more often, and your amount of work becomes super-linear. Now whether that matters all that much or not is perhaps another discussion. The current strategy is mostly to repack all-into-one with no base, which is the worst possible case. So just about any rollup strategy will be an improvement. ;) -Peff