Re: [PATCH] builtin/repack.c: remove redundant pack-based bitmaps

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Oct 17, 2022 at 02:02:53PM -0400, Jeff King wrote:
> On Wed, Oct 12, 2022 at 03:05:33PM -0400, Taylor Blau wrote:
>
> > When we write a MIDX bitmap after repacking, it is possible that the
> > repository would be left in a state with both pack- and multi-pack
> > reachability bitmaps.
> >
> > This can occur, for instance, if a pack that was kept (either by having
> > a .keep file, or during a geometric repack in which it is not rolled up)
> > has a bitmap file, and the repack wrote a multi-pack index and bitmap.
> >
> > When loading a reachability bitmap for the repository, the multi-pack
> > one is always preferred, so the pack-based one is redundant. Let's
> > remove it unconditionally, even if '-d' isn't passed, since there is no
> > practical reason to keep both around. The patch below does just that.
>
> Yeah, this is certainly a reasonable thing to be doing. I wonder if you
> want to share the story of why the original midx-bitmap series did not
> include this patch. It is (IMHO, at least) an interesting debugging
> tale.

Sure, though note that this is a mystery that plague me on-and-off for a
year or so before finally figuring it out ;-).

When we initially deployed MIDX bitmaps at GitHub, I held back this
patch as a recovery mechanism (ie., if something was horribly broken
with MIDX bitmaps, we could quickly deploy a patch to stop reading them,
knowing that the existing pack bitmaps were still around).

After a while building up confidence in the MIDX bitmaps implementation,
I wrote something that resembles the patch here and deployed it. No big
deal, right?

Wrong. The "total time spent running `git repack`" metrics started
ticking up. But the metrics ticked up even in hosts and sites that had
not yet received the patch. Huh?

This led to many of the trace2 regions in midx.c that I recently[1, 2]
sent. But after deploying that patch and reverting it over and over
again, nothing in the trace2 metrics ended up yielding a satisfying
answer.

I spent a few days trying to debug the issue but didn't make any
progress. I confirmed that, yes, we were indeed reading MIDX bitmaps and
not relying on the pack-based bitmaps for some weird reason. And after I
confirmed that, I set the bug aside to work on other things, and mostly
forgot about it.

But in an unrelated incident, a colleague of mine noted a bug in a
GitHub-internal program called `git-get-unpacked-size`. The output of
this program is the number of bytes in non-bitmapped packs, and it's
used as a heuristic to help us pick repository networks to repack
roughly in order of "most non-repacked bytes".

The bug in `get-unpacked-size` was that we were including the size of
the cruft pack, meaning that certain networks would get scheduled much
more often. But in fixing that bug, I noticed that packs in the MIDX
would also get counted against our unpacked size.

So I wrote the minimal fix to fix the cruft pack bug, and noted the MIDX
issue for another day. When we did come back to it (I say "we", since
Victoria Dye implemented the actual fix), I suggested that we ignore any
packs that appear in a MIDX from the unpacked size, since we only write
a MIDX after repacking a repository.

When we did so, I suspected that it might inadvertently solve the original mystery. It did, and here's why.

Individual maintenance runs did not get slower when removing redundant
pack bitmaps. But because removing the .bitmap file causes that pack to
be counted towards a repository's total unpacked size, it biases
scheduling towards larger repositories (which have bigger packs, and
therefore a more inflated overall size). So we weren't slowing anything
down, just doing a slightly more expensive maintenance (on bigger
repositories) slightly more often. This explained not only the uptick,
but why the effect was so resistant to measuring.

It also explained why we saw the same effect when removing pack bitmaps
with MIDX bitmaps deployed to a single site only: whenever `git
get-unpacked-size` ran on a replica that *did* have the new code, the
unpacked-size count would get inflated. That explains why we saw
maintenance times tick up outside of our testing site.

So, that's the story :-).

[1]: https://lore.kernel.org/git/dc50527d99680fb0ff1f3240531105badaa221dc.1665612094.git.me@xxxxxxxxxxxx/
[2]: https://lore.kernel.org/git/dd5ceb1ec3c01e8a2af55130718fff0b5eaf2de0.1665612094.git.me@xxxxxxxxxxxx/

Thanks,
Taylor



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux