Re: [PATCH 2/2] builtin/repack.c: implement support for `--cruft-max-size`

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Sep 08, 2023 at 01:21:44PM +0200, Patrick Steinhardt wrote:
> On Thu, Sep 07, 2023 at 05:52:04PM -0400, Taylor Blau wrote:
> [snip]
> > @@ -125,17 +133,39 @@ static void mark_packs_for_deletion_1(struct string_list *names,
> >  		if (len < hexsz)
> >  			continue;
> >  		sha1 = item->string + len - hexsz;
> > -		/*
> > -		 * Mark this pack for deletion, which ensures that this
> > -		 * pack won't be included in a MIDX (if `--write-midx`
> > -		 * was given) and that we will actually delete this pack
> > -		 * (if `-d` was given).
> > -		 */
> > -		if (!string_list_has_string(names, sha1))
> > -			item->util = (void*)1;
> > +
> > +		if (pack_is_retained(item)) {
> > +			item->util = NULL;
> > +		} else if (!string_list_has_string(names, sha1)) {
> > +			/*
> > +			 * Mark this pack for deletion, which ensures
> > +			 * that this pack won't be included in a MIDX
> > +			 * (if `--write-midx` was given) and that we
> > +			 * will actually delete this pack (if `-d` was
> > +			 * given).
> > +			 */
> > +			item->util = DELETE_PACK;
> > +		}
>
> I find the behaviour of this function a tad surprising as it doesn't
> only mark a pack for deletion, but it also marks a pack as not being
> retained anymore. Shouldn't we rather:
>
>     if (pack_is_retained(item)) {
>         // Theoretically speaking we shouldn't even do this bit here as
>         // we _un_mark the pack for deletion. But at least we shouldn't
>         // be removing the `RETAIN_PACK` bit, I'd think.
>         item->util &= ~DELETE_PACK;
>     } else if (!string_list_has_string(names, sha1)) {
>         // And here we shouldn't discard the `RETAIN_PACK` bit either.
>         item->util |= DELETE_PACK;
>     }

I think the new version should address these issues. But yeah, I
definitely understand your confusion here. I think what's written in
this patch is OK, because we check only whether the `->util` field is
non-NULL before deleting, which is why we have to remove the RETAINED
bit.

But the new version looks like this instead:

    if (pack_is_retained(item))
        pack_unmark_for_deletion(item);
    else if (!string_list_has_string(names, sha1))
        pack_mark_for_deletion(item);

the RETAINED bits still stick around (pack_unmark_for_deletion() just
does `item->util &= ~DELETE_PACK`), but we don't consult them after
mark_packs_for_deletion_1() has finished executing. Instead we just
check for the existence of the DELETE_PACK bit, rather than whether or
not the whole util field is NULL.

> > @@ -799,6 +831,72 @@ static void remove_redundant_bitmaps(struct string_list *include,
> >  	strbuf_release(&path);
> >  }
> >
> > +static int existing_cruft_pack_cmp(const void *va, const void *vb)
> > +{
> > +	struct packed_git *a = *(struct packed_git **)va;
> > +	struct packed_git *b = *(struct packed_git **)vb;
> > +
> > +	if (a->pack_size < b->pack_size)
> > +		return -1;
> > +	if (a->pack_size > b->pack_size)
> > +		return 1;
> > +	return 0;
> > +}
> > +
> > +static void collapse_small_cruft_packs(FILE *in, unsigned long max_size,
>
> We might want to use `size_t` to denote file sizes instead of `unsigned
> long`.

We can safely change these to use size_t, but let's leave OPT_MAGNITUDE
alone (and treat that portion as #leftoverbits).

> > +		p = existing_cruft[i];
> > +		proposed = st_add(total_size, p->pack_size);
> > +
> > +		if (proposed <= max_size) {
> > +			total_size = proposed;
> > +			fprintf(in, "-%s\n", pack_basename(p));
> > +		} else {
> > +			retain_cruft_pack(existing, p);
> > +			fprintf(in, "%s\n", pack_basename(p));
> > +		}
>
> It's a bit funny that we re-check whether we have exceeded the maximum
> size in subsequente iterations once we hit the limit, but it arguably
> makes the logic a bit simpler.

Yeah. Those checks are all noops (IOW, once we end up in the else
branch, we'll stay there for the rest of the loop). But we don't want to
break early, because we have to call retain_cruft_pack() on everything.
In theory you could do something like:

    for (i = 0; i < existing_cruft_nr; i++) {
        size_t proposed;
        p = existing_cruft[i];
        proposed = st_add(total_size, p->pack_size);

        if (proposed <= max_size) {
            total_size = proposed;
            fprintf(in, "-%s\n", pack_basename(p));
        } else {
            break;
        }
    }

    for (; i < existing_cruft_nr; i++) {
        retain_cruft_pack(existing, existing_cruft[i]);
        fprintf(in, "%s\n", pack_basename(existing_cruft[i]));
    }

But I think that the above is slightly more error-prone than what is
written in the original patch. I have only the vaguest of preferences
towards the former, but I'm happy to change it around if you feel
strongly.

> If I understand correctly, we only collapse small cruft packs in case
> we're not expiring any objects at the same time. Is there an inherent
> reason why? I would imagine that it can indeed be useful to expire
> objects contained in cruft packs and then have git-repack(1) recombine
> whatever is left into larger packs.
>
> If the reason is basically "it's complicated" then that is fine with me,
> we can still implement the functionality at a later point in time. But
> until then I think that we should let callers know that the two options
> are incompatible with each other by producing an error when both are
> passed.

Your understanding is correct. We could try to leave existing cruft
packs alone when none of their objects are removed as a result of
pruning, but that case should be relatively rare. Another thing you
could do is handle cruft packs which have only part of their objects
being pruned by combining the non-pruned parts into a new pack.

The latter should be mostly straightforward to implement, but since
we're often ending up with very few cruft objects post-pruning, it
likely wouldn't help much.

Thanks,
Taylor



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux