On Mon, Oct 17, 2022 at 02:02:53PM -0400, Jeff King wrote: > On Wed, Oct 12, 2022 at 03:05:33PM -0400, Taylor Blau wrote: > > > When we write a MIDX bitmap after repacking, it is possible that the > > repository would be left in a state with both pack- and multi-pack > > reachability bitmaps. > > > > This can occur, for instance, if a pack that was kept (either by having > > a .keep file, or during a geometric repack in which it is not rolled up) > > has a bitmap file, and the repack wrote a multi-pack index and bitmap. > > > > When loading a reachability bitmap for the repository, the multi-pack > > one is always preferred, so the pack-based one is redundant. Let's > > remove it unconditionally, even if '-d' isn't passed, since there is no > > practical reason to keep both around. The patch below does just that. > > Yeah, this is certainly a reasonable thing to be doing. I wonder if you > want to share the story of why the original midx-bitmap series did not > include this patch. It is (IMHO, at least) an interesting debugging > tale. Sure, though note that this is a mystery that plague me on-and-off for a year or so before finally figuring it out ;-). When we initially deployed MIDX bitmaps at GitHub, I held back this patch as a recovery mechanism (ie., if something was horribly broken with MIDX bitmaps, we could quickly deploy a patch to stop reading them, knowing that the existing pack bitmaps were still around). After a while building up confidence in the MIDX bitmaps implementation, I wrote something that resembles the patch here and deployed it. No big deal, right? Wrong. The "total time spent running `git repack`" metrics started ticking up. But the metrics ticked up even in hosts and sites that had not yet received the patch. Huh? This led to many of the trace2 regions in midx.c that I recently[1, 2] sent. But after deploying that patch and reverting it over and over again, nothing in the trace2 metrics ended up yielding a satisfying answer. I spent a few days trying to debug the issue but didn't make any progress. I confirmed that, yes, we were indeed reading MIDX bitmaps and not relying on the pack-based bitmaps for some weird reason. And after I confirmed that, I set the bug aside to work on other things, and mostly forgot about it. But in an unrelated incident, a colleague of mine noted a bug in a GitHub-internal program called `git-get-unpacked-size`. The output of this program is the number of bytes in non-bitmapped packs, and it's used as a heuristic to help us pick repository networks to repack roughly in order of "most non-repacked bytes". The bug in `get-unpacked-size` was that we were including the size of the cruft pack, meaning that certain networks would get scheduled much more often. But in fixing that bug, I noticed that packs in the MIDX would also get counted against our unpacked size. So I wrote the minimal fix to fix the cruft pack bug, and noted the MIDX issue for another day. When we did come back to it (I say "we", since Victoria Dye implemented the actual fix), I suggested that we ignore any packs that appear in a MIDX from the unpacked size, since we only write a MIDX after repacking a repository. When we did so, I suspected that it might inadvertently solve the original mystery. It did, and here's why. Individual maintenance runs did not get slower when removing redundant pack bitmaps. But because removing the .bitmap file causes that pack to be counted towards a repository's total unpacked size, it biases scheduling towards larger repositories (which have bigger packs, and therefore a more inflated overall size). So we weren't slowing anything down, just doing a slightly more expensive maintenance (on bigger repositories) slightly more often. This explained not only the uptick, but why the effect was so resistant to measuring. It also explained why we saw the same effect when removing pack bitmaps with MIDX bitmaps deployed to a single site only: whenever `git get-unpacked-size` ran on a replica that *did* have the new code, the unpacked-size count would get inflated. That explains why we saw maintenance times tick up outside of our testing site. So, that's the story :-). [1]: https://lore.kernel.org/git/dc50527d99680fb0ff1f3240531105badaa221dc.1665612094.git.me@xxxxxxxxxxxx/ [2]: https://lore.kernel.org/git/dd5ceb1ec3c01e8a2af55130718fff0b5eaf2de0.1665612094.git.me@xxxxxxxxxxxx/ Thanks, Taylor