On 7/23/2020 6:15 PM, Junio C Hamano wrote: > It might be too late to ask this now, but how does the quality of > the resulting combined pack ensured, wrt locality and deltification? There are two questions here, really. The first is: given the set of objects to pack, are we packing them as efficiently as possible? Since e11d86de139 (midx: teach "git multi-pack-index repack" honor "git repack" configurations, 2020-05-10), the 'repack' subcommand honors the configured recommendations for deltas. This includes: (requires updating the arguments to pack-objects) * repack.useDeltaBaseOffset * repack.useDeltaIslands (automatically respected by pack-objects) * repack.packKeptsObjects * pack.threads * pack.depth * pack.window * pack.windowMemory * pack.deltaCacheSize * pack.deltaCacheLimit All of these config settings allow the user to specify how hard to try for delta compression. If they know something about their data or their tolerance for extra CPU time during pack-objects, then they can get better deltas by changing these values. The second question is "how well do the deltas compress when only packing incrementally versus packing the entire repo?" One important way to consider these things is how the pack- files are created. If we expect most pack-files coming from 'git fetch' calls, then there are some interesting patterns that arise. I started measuring by creating a local clone of the Linux kernel repo starting at v5.0 and then fetching an increment of ten commits from the first-parent history of later tags. Each fetch created a pack-file of ~300 MB relative to the base pack-file of ~1.6 GB. Collecting ten of these in a row leads to almost 2 GB of "fetched" packs. However, keep in mind that we didn't fetch 2 GB of data "across the wire" but instead expanded the thin pack into a full pack by copying the base objects. After running the incremental-repack step, that ~2 GB of data re-compresses back down to one pack-file of size ~300 MB. _Why_ did 10 pack-files all around 300 MB get repacked at once? It's because there were duplicate objects across those pack-files! Recall that the multi-pack-index repack computes batch sizes by computing an "estimated pack size" by counting how many objects in that pack-file are referenced by the multi-pack-index, then computing expected size = actual size * num objects / num referenced objects In this case, the "base" objects that are copied between the fetches already exist in these smaller pack-files. Thus, when the batch-size is ~300 MB it still repacks all 10 "small" packs into a new pack that is still ~300 MB. Now, this is still a little wasteful. That second pack has a significant "extra space" cost. However, it came at a bonus of writing much less data. Perhaps the Linux kernel repository is just too small to care about this version of maintenance? In such a case, I can work to introduce a 'full-repack' task that is more aggressive with repacking all pack-files. This could use the multi-pack-index repack with --batch-size=0 to still benefit from the repack/expire trick for safe concurrency. Other ideas are to try repacking in other ways, such as by object type, to maximize easy wins. For example, perhaps we repack all of the commits and trees every time, but leave the blobs to be repacked when we are ready to spend time on removing deltas? I think the incremental-repack has value, and perhaps it is isolated to super-huge repositories. That can be controlled by limiting its use to those when an expert user configures Git to use it. I remain open to recommendations from others with more experience with delta compression to recommend alternatives. tl,dr: the incremental-repack isn't the most space-efficient thing we can do, and that's by design. Thanks, -Stolee