Re: [PATCH v2 7/8] packfile: add kept-pack cache for find_kept_pack_entry()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Feb 17, 2021 at 12:11:00PM -0500, Jeff King wrote:
> On Wed, Feb 03, 2021 at 10:59:21PM -0500, Taylor Blau wrote:
>
> > diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> > index fbd7b54d70..b2ba5aa14f 100644
> > --- a/builtin/pack-objects.c
> > +++ b/builtin/pack-objects.c
> > @@ -1225,9 +1225,9 @@ static int want_found_object(const struct object_id *oid, int exclude,
> >  		 */
> >  		unsigned flags = 0;
> >  		if (ignore_packed_keep_on_disk)
> > -			flags |= ON_DISK_KEEP_PACKS;
> > +			flags |= CACHE_ON_DISK_KEEP_PACKS;
> >  		if (ignore_packed_keep_in_core)
> > -			flags |= IN_CORE_KEEP_PACKS;
> > +			flags |= CACHE_IN_CORE_KEEP_PACKS;
>
> Why are we renaming the constants in this patch?
>
> I know I'm listed as the author, but I think this came out of some
> off-list back and forth between us. It seems like the existing constants
> would have been fine.

Yeah, they would have been fine. They were renamed because this patch
makes them only used for the kept pack cache, but I agree the existing
names are fine, too.

In any case, they make an easier-to-read diff, so I'm perfectly happy to
un-rename them ;).

> > +static void maybe_invalidate_kept_pack_cache(struct repository *r,
> > +					     unsigned flags)
> >  {
> > -	return find_one_pack_entry(r, oid, e, 0);
> > +	if (!r->objects->kept_pack_cache)
> > +		return;
> > +	if (r->objects->kept_pack_cache->flags == flags)
> > +		return;
> > +	free(r->objects->kept_pack_cache->packs);
> > +	FREE_AND_NULL(r->objects->kept_pack_cache);
> > +}
>
> OK, so we keep a single cache based on the flags, and then if somebody
> ever asks for different flags, we throw it away. That's probably OK for
> our purposes, since we wouldn't expect multiple callers within a single
> process.
>
> I wondered if it would be simpler to just keep two lists, one for
> in-core keeps and one for on-disk keeps. And then just walk over each
> list separately based on the query flags. That makes things more robust
> _and_ I think would be less code. It does mean that a pack could appear
> in both lists, though, which means we might do a lookup in it twice.
> That doesn't seem all that likely, but it is working against our goal
> here.
>
> Another option is to keep 3 caches (two separate and one combined),
> rather than flipping between them. I'm not sure if that would be less
> code or not (it gets rid of the "invalidate" function, but you do have
> to pick the right cache depending on the query flags).
>
> Yet another option is to keep a cache of any that are marked as _either_
> in core or on-disk keeps, and then decide to look up the object based on
> the query flags. Then you just pay the cost to iterate over the list and
> check the flags (which really is all this cache is helping with in the
> first place).

All interesting ideas. In this patch (and by the end of the series)
callers that use the kept pack cache never ask for the cache with a
different set of flags. IOW, there isn't a situation where a caller
would populate the in-core kept pack cache, and then suddenly ask for
both in-core and on-disk packs to be kept.

So all of this code is defensive in case that were to change, and
suddenly we'd be returning subtly wrong results. I could imagine that
being kind of a nasty bug to track down, so detecting and invalidating
the cache would make it a non-issue.

I'll note it in the commit message, though, since it's good for future
readers to be aware, too.

> I dunno. TBH, I kind of wonder if this whole patch is worth doing at
> all, giving the underwhelming performance benefit (3% on the
> pathological 1000-pack case). When I had timed this strategy initially,
> it was more like 15%. I'm not sure where the savings went in the
> interim, or if it was a timing fluke.

Yeah, I dunno. It's certainly not hurting (I don't think the extra code
is all that complex, and the savings is at least non-zero), so I'm
inclined to keep it.

> > +static struct packed_git **kept_pack_cache(struct repository *r, unsigned flags)
> > +{
> > +	maybe_invalidate_kept_pack_cache(r, flags);
> > +
> > +	if (!r->objects->kept_pack_cache) {
> > +		struct packed_git **packs = NULL;
> > +		size_t nr = 0, alloc = 0;
> > +		struct packed_git *p;
> > +
> > +		/*
> > +		 * We want "all" packs here, because we need to cover ones that
> > +		 * are used by a midx, as well. We need to look in every one of
> > +		 * them (instead of the midx itself) to cover duplicates. It's
> > +		 * possible that an object is found in two packs that the midx
> > +		 * covers, one kept and one not kept, but the midx returns only
> > +		 * the non-kept version.
> > +		 */
> > +		for (p = get_all_packs(r); p; p = p->next) {
> > +			if ((p->pack_keep && (flags & CACHE_ON_DISK_KEEP_PACKS)) ||
> > +			    (p->pack_keep_in_core && (flags & CACHE_IN_CORE_KEEP_PACKS))) {
> > +				ALLOC_GROW(packs, nr + 1, alloc);
> > +				packs[nr++] = p;
> > +			}
> > +		}
> > +		ALLOC_GROW(packs, nr + 1, alloc);
> > +		packs[nr] = NULL;
> > +
> > +		r->objects->kept_pack_cache = xmalloc(sizeof(*r->objects->kept_pack_cache));
> > +		r->objects->kept_pack_cache->packs = packs;
> > +		r->objects->kept_pack_cache->flags = flags;
> > +	}
>
> Is there any reason not to just embed the kept_pack_cache struct inside
> the object_store? It's one less pointer to deal with. I wonder if this
> is a holdover from an attempt to have multiple caches.
>
> (I also think it would be reasonable if we wanted to hide the definition
> of the cache struct from callers, but we don't seem do to that).

Not a holdover, just designed to avoid adding too many extra fields to
the object-store. I don't feel strongly, but I do think hiding the
definition is a good idea, so I'll inline it.

> > @@ -2109,7 +2123,8 @@ int has_object_pack(const struct object_id *oid)
> >  	return find_pack_entry(the_repository, oid, &e);
> >  }
> >
> > -int has_object_kept_pack(const struct object_id *oid, unsigned flags)
> > +int has_object_kept_pack(const struct object_id *oid,
> > +			 unsigned flags)
> >  {
> >  	struct pack_entry e;
> >  	return find_kept_pack_entry(the_repository, oid, flags, &e);
>
> This seems like a stray change.

Good eyes, thanks.

>
> > diff --git a/packfile.h b/packfile.h
> > index 624327f64d..eb56db2a7b 100644
> > --- a/packfile.h
> > +++ b/packfile.h
> > @@ -161,10 +161,6 @@ int packed_object_info(struct repository *r,
> >  void mark_bad_packed_object(struct packed_git *p, const unsigned char *sha1);
> >  const struct packed_git *has_packed_and_bad(struct repository *r, const unsigned char *sha1);
> >
> > -#define ON_DISK_KEEP_PACKS 1
> > -#define IN_CORE_KEEP_PACKS 2
> > -#define ALL_KEEP_PACKS (ON_DISK_KEEP_PACKS | IN_CORE_KEEP_PACKS)
>
> I notice that when the constants moved, we didn't keep an equivalent of
> ALL_KEEP_PACKS. Maybe we didn't need it in the first place in patch 1?

Yeah, we didn't need it to begin with. I'll drop it accordingly.

>   BTW, I absolutely hate the complication that all of this on-disk
>   versus in-core keep distinction brings to this code. And I wondered
>   what it was really doing for us and whether we could get rid of it.
>   But I think we do need it: a common case may be to avoid using
>   --honor-pack-keep (because you don't want to deal with racy .keep
>   writes from incoming receive-pack processes), but use in-core ones for
>   something like --stdin-packs. So we do need to respect one and not the
>   other.
>
>   I do wonder if things would be simpler if pack-objects simply kept its
>   own list of "in core" packs in a separate array. But that is really
>   just another form of the same problem, I guess.

Yeah, the complexity is awfully hard to reason about, but you're right
that here it is necessary.

> -Peff

Thanks,
Taylor



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux