On Tue, Mar 4, 2025 at 1:35 PM Taylor Blau <me@xxxxxxxxxxxx> wrote: > > This is a minor reroll of my series to fix a bug in freshening objects > with multiple cruft packs. > > Since last time the only updates to the series are textual changes to > the commit message, so everything functionally should be working as it > was before. > > As usual, there is a range-diff showing as much below, and the original > cover letter is as follows: > > --- > > This short series contains a fix for a bug I noticed while rolling out > multi-cruft packs (via 'git repack --max-cruft-size') within GitHub's > infrastructure. > > The series is structured as follows: > > - The first patch simplifies how 'repack' aggregates cruft packs > together when their size is below the '--max-cruft-size' or > '--max-pack-size' threshold. This simplification changes behavior > slightly, but not in a meaningful way. It occurred to me while > writing the second patch. > > - The second patch describes and fixes the main bug. The gist here is > that objects which are (a) unreachable, (b) exist in a cruft pack > being retained, and (c) were freshened to have a more recent mtime > than any existing cruft copy are unable to be freshened. > > The fix pursued in the second patch changes the rules around when we > want to retain an object via builtin/pack-objects.c::want_found_object() > when at least one cruft pack will survive the repack. > > Previously the rule was to discard any object which appears in any > surviving pack, regardless of mtime. The rule now is to only discard an > object if it appears in either (a) a non-cruft pack which will survive > the repack, or (b) a cruft pack whose mtime for that object is older > than the one we are trying to pack. > > I think that this is the right behavior, but admittedly putting this > series together hurt my brain trying to think through all of the cases. > I'm fairly confident in the testing here as I remember it being fairly > exhaustive of all interesting cases. But I'd appreciate a sanity check > from others that they too are convinced this is the right approach. > > Thanks in advance for your review! > > Taylor Blau (2): > builtin/repack.c: simplify cruft pack aggregation > builtin/pack-objects.c: freshen objects from existing cruft packs > > builtin/pack-objects.c | 118 ++++++++++++++++++++++++++++++++++------ > builtin/repack.c | 38 +------------ > packfile.c | 3 +- > packfile.h | 2 + > t/t7704-repack-cruft.sh | 105 +++++++++++++++++++++-------------- > 5 files changed, 170 insertions(+), 96 deletions(-) > > Range-diff against v1: > 1: 8564f982597 ! 1: 63ea9d4d00e builtin/repack.c: simplify cruft pack aggregation > @@ Commit message > would get combined together until the sum of their sizes was no larger > than the given max pack size. > > - There is a much simpler way to achieve this, however, which is to simply > - combine *all* cruft packs which are smaller than the threshold, > + There is a much simpler way to combine cruft packs, however, which is to > + simply combine *all* cruft packs which are smaller than the threshold, > regardless of what their sum is. With '--max-pack-size', 'pack-objects' > will split out the resulting pack into individual pack(s) if necessary > to ensure that the written pack(s) are each no larger than the provided > 2: c0c926adde2 ! 2: 7ba9054701b builtin/pack-objects.c: freshen objects from existing cruft packs > @@ Commit message > only be modified in a pruning GC, or if the threshold itself is > adjusted. > > - However, this process breaks down when we attempt to freshen an object > - packed in an earlier cruft pack that is larger than the threshold and > - thus will survive the repack. > + Prior to this patch, however, this process breaks down when we attempt > + to freshen an object packed in an earlier cruft pack, and that cruft > + pack is larger than the threshold and thus will survive the repack. > > When this is the case, it is impossible to freshen objects in cruft > - pack(s) which are larger than the threshold. This is because we avoid > - writing them in the new cruft pack entirely, for a couple of reasons. > + pack(s) when those cruft packs are larger than the threshold. This is > + because we would avoid writing them in the new cruft pack entirely, for > + a couple of reasons. > > 1. When enumerating packed objects via 'add_objects_in_unpacked_packs()' > we pass the SKIP_IN_CORE_KEPT_PACKS, which is used to avoid looping > @@ Commit message > - exists in a non-cruft pack that we are retaining, regardless of that > pack's mtime, or > > - - exists in a cruft pack with an mtime more recent than the copy we are > - debating whether or not to pack, in which case freshening would be > - redundant. > + - exists in a cruft pack with an mtime at least as recent as the copy > + we are debating whether or not to pack, in which case freshening > + would be redundant. > > To do this, keep track of whether or not we have any cruft packs in our > in-core kept list with a new 'ignore_packed_keep_in_core_has_cruft' > flag. When we end up in this new special case, we replace a call to > - 'has_object_kept_pack()' to 'want_cruft_object_mtime()', and only > - reject objects when we have a copy in an existing cruft pack with a more > - recent mtime (in which case "freshening" would be redundant). > + 'has_object_kept_pack()' to 'want_cruft_object_mtime()', and only reject > + objects when we have a copy in an existing cruft pack with at least as > + recent an mtime as our candidate (in which case "freshening" would be > + redundant). > > Signed-off-by: Taylor Blau <me@xxxxxxxxxxxx> > > @@ t/t7704-repack-cruft.sh: test_expect_success '--max-cruft-size with freshened ob > + > + git repack --cruft -d && > + > -+ # Make a packed copy of object $foo with a more recent > -+ # mtime. > ++ # Make an identical copy of foo stored in a pack with a more > ++ # recent mtime. > + foo="$(generate_random_blob foo $((2*1024*1024)))" && > + foo_pack="$(echo "$foo" | git pack-objects $packdir/pack)" && > + test-tool chmtime --get -100 \ > + "$packdir/pack-$foo_pack.pack" >foo.new && > + git prune-packed && > + > -+ # Make a loose copy of object $bar with a more recent > -+ # mtime. > ++ # Make a loose copy of bar, also with a more recent mtime. > + bar="$(generate_random_blob bar $((2*1024*1024)))" && > + test-tool chmtime --get -100 \ > + "$objdir/$(test_oid_to_path "$bar")" >bar.new && > > base-commit: 08bdfd453584e489d5a551aecbdcb77584e1b958 > -- > 2.49.0.rc0.57.gdb91954e186.dirty v2 looks good to me; though I'm curious if some wording improvement in the commit message might help in distinguishing between --max-cruft-size and --max-pack-size...and whether we want to provide any checks on the relative sizes of the two.