On Thu, Nov 30, 2023 at 11:18:19AM +0100, Patrick Steinhardt wrote: > > Performing verbatim pack reuse naturally trades off between CPU time and > > the resulting pack size. In the above example, the single-pack reuse > > case produces a clone size of ~194 MB on my machine, while the > > multi-pack reuse case produces a clone size closer to ~266 MB, which is > > a ~37% increase in clone size. > > Quite exciting, and a tradeoff that may be worth it for Git hosters. I > expect that this is going to be an extreme example of the benefits > provided by your patch series -- do you by any chance also have "real" > numbers that make it possible to quantify the effect a bit better? > > No worry if you don't, I'm just curious. I don't have a great sense, no. I haven't run these patches yet in production, although would like to do so soon for internal repositories to get a better sense here. There are some performance tests at the end which try and give you a sense of at least the relative speed-up depending on how many disjoint packs you have (IIRC, we test for 1, 10, and 100 disjoint packs). > > I think there is still some opportunity to close this gap, since the > > "packing" strategy here is extremely naive. In a production setting, I'm > > sure that there are more well thought out repacking strategies that > > would produce more similar clone sizes. > > > > I considered breaking this series up into smaller chunks, but was > > unsatisfied with the result. Since this series is rather large, if you > > have alternate suggestions on better ways to structure this, please let > > me know. > > The series is indeed very involved to review. I only made it up to patch > 8/24 and already spent quite some time on it. So I'd certainly welcome > it if this was split up into smaller parts, but don't have a suggestion > as to how this should be done (also because I didn't yet read the other > 16 patches). I suppose that one way to break it up might be: pack-objects: free packing_data in more places pack-bitmap-write: deep-clear the `bb_commit` slab pack-bitmap: plug leak in find_objects() midx: factor out `fill_pack_info()` midx: implement `DISP` chunk midx: implement `midx_locate_pack()` midx: implement `--retain-disjoint` mode pack-objects: implement `--ignore-disjoint` mode repack: implement `--extend-disjoint` mode pack-bitmap: pass `bitmapped_pack` struct to pack-reuse functions pack-bitmap: simplify `reuse_partial_packfile_from_bitmap()` signature pack-bitmap: return multiple packs via `reuse_partial_packfile_from_bitmap()` pack-objects: parameterize pack-reuse routines over a single pack pack-objects: keep track of `pack_start` for each reuse pack pack-objects: pass `bitmapped_pack`'s to pack-reuse functions pack-objects: prepare `write_reused_pack()` for multi-pack reuse pack-objects: prepare `write_reused_pack_verbatim()` for multi-pack reuse pack-objects: include number of packs reused in output pack-bitmap: prepare to mark objects from multiple packs for reuse pack-objects: add tracing for various packfile metrics t/test-lib-functions.sh: implement `test_trace2_data` helper pack-objects: allow setting `pack.allowPackReuse` to "single" pack-bitmap: reuse objects from all disjoint packs t/perf: add performance tests for multi-pack reuse Then you'd have five patch series, where each series does roughly the following: 1. Preparatory clean-up. 2. Implementing the DISP chunk, as well as --retain-disjoint, without a way to generate such packs. 3. Implement a way to generate such packs, but without actually being able to reuse more than one of them. 4. Implement multi-pack reuse, but without actually reusing any packs. 5. Enable multi-pack reuse (and implement the required scaffolding to do so), and test it. That's the most sensible split that I could come up with, at least. But I still find it relatively unsatisfying for a couple of reasons: - With the exception of the last group of patches, none of the earlier series enable any new, useful behavior on their own. IOW, if we just merged the first three series and then forgot about this topic, we wouldn't have done anything useful ;-). - The fourth series (which actually implements multi-pack reuse) still remains the most complicated, and would likely be the most difficult to review. So I think you'd still have one difficult series to review, plus four other series which are slightly less difficult to review ;-). It's entirely possible that I'm just too close to these patches to see a better split, so if you (or others) have any suggestions on how to break this up, please don't hesitate to share them. > I'll review the remaining patches at a later point in time. Thanks, I'll look forward to more of your review as usual :-). Thanks, Taylor