On Wed, Feb 03, 2021 at 10:59:21PM -0500, Taylor Blau wrote: > diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c > index fbd7b54d70..b2ba5aa14f 100644 > --- a/builtin/pack-objects.c > +++ b/builtin/pack-objects.c > @@ -1225,9 +1225,9 @@ static int want_found_object(const struct object_id *oid, int exclude, > */ > unsigned flags = 0; > if (ignore_packed_keep_on_disk) > - flags |= ON_DISK_KEEP_PACKS; > + flags |= CACHE_ON_DISK_KEEP_PACKS; > if (ignore_packed_keep_in_core) > - flags |= IN_CORE_KEEP_PACKS; > + flags |= CACHE_IN_CORE_KEEP_PACKS; Why are we renaming the constants in this patch? I know I'm listed as the author, but I think this came out of some off-list back and forth between us. It seems like the existing constants would have been fine. > +static void maybe_invalidate_kept_pack_cache(struct repository *r, > + unsigned flags) > { > - return find_one_pack_entry(r, oid, e, 0); > + if (!r->objects->kept_pack_cache) > + return; > + if (r->objects->kept_pack_cache->flags == flags) > + return; > + free(r->objects->kept_pack_cache->packs); > + FREE_AND_NULL(r->objects->kept_pack_cache); > +} OK, so we keep a single cache based on the flags, and then if somebody ever asks for different flags, we throw it away. That's probably OK for our purposes, since we wouldn't expect multiple callers within a single process. I wondered if it would be simpler to just keep two lists, one for in-core keeps and one for on-disk keeps. And then just walk over each list separately based on the query flags. That makes things more robust _and_ I think would be less code. It does mean that a pack could appear in both lists, though, which means we might do a lookup in it twice. That doesn't seem all that likely, but it is working against our goal here. Another option is to keep 3 caches (two separate and one combined), rather than flipping between them. I'm not sure if that would be less code or not (it gets rid of the "invalidate" function, but you do have to pick the right cache depending on the query flags). Yet another option is to keep a cache of any that are marked as _either_ in core or on-disk keeps, and then decide to look up the object based on the query flags. Then you just pay the cost to iterate over the list and check the flags (which really is all this cache is helping with in the first place). I dunno. TBH, I kind of wonder if this whole patch is worth doing at all, giving the underwhelming performance benefit (3% on the pathological 1000-pack case). When I had timed this strategy initially, it was more like 15%. I'm not sure where the savings went in the interim, or if it was a timing fluke. > +static struct packed_git **kept_pack_cache(struct repository *r, unsigned flags) > +{ > + maybe_invalidate_kept_pack_cache(r, flags); > + > + if (!r->objects->kept_pack_cache) { > + struct packed_git **packs = NULL; > + size_t nr = 0, alloc = 0; > + struct packed_git *p; > + > + /* > + * We want "all" packs here, because we need to cover ones that > + * are used by a midx, as well. We need to look in every one of > + * them (instead of the midx itself) to cover duplicates. It's > + * possible that an object is found in two packs that the midx > + * covers, one kept and one not kept, but the midx returns only > + * the non-kept version. > + */ > + for (p = get_all_packs(r); p; p = p->next) { > + if ((p->pack_keep && (flags & CACHE_ON_DISK_KEEP_PACKS)) || > + (p->pack_keep_in_core && (flags & CACHE_IN_CORE_KEEP_PACKS))) { > + ALLOC_GROW(packs, nr + 1, alloc); > + packs[nr++] = p; > + } > + } > + ALLOC_GROW(packs, nr + 1, alloc); > + packs[nr] = NULL; > + > + r->objects->kept_pack_cache = xmalloc(sizeof(*r->objects->kept_pack_cache)); > + r->objects->kept_pack_cache->packs = packs; > + r->objects->kept_pack_cache->flags = flags; > + } Is there any reason not to just embed the kept_pack_cache struct inside the object_store? It's one less pointer to deal with. I wonder if this is a holdover from an attempt to have multiple caches. (I also think it would be reasonable if we wanted to hide the definition of the cache struct from callers, but we don't seem do to that). > @@ -2109,7 +2123,8 @@ int has_object_pack(const struct object_id *oid) > return find_pack_entry(the_repository, oid, &e); > } > > -int has_object_kept_pack(const struct object_id *oid, unsigned flags) > +int has_object_kept_pack(const struct object_id *oid, > + unsigned flags) > { > struct pack_entry e; > return find_kept_pack_entry(the_repository, oid, flags, &e); This seems like a stray change. > diff --git a/packfile.h b/packfile.h > index 624327f64d..eb56db2a7b 100644 > --- a/packfile.h > +++ b/packfile.h > @@ -161,10 +161,6 @@ int packed_object_info(struct repository *r, > void mark_bad_packed_object(struct packed_git *p, const unsigned char *sha1); > const struct packed_git *has_packed_and_bad(struct repository *r, const unsigned char *sha1); > > -#define ON_DISK_KEEP_PACKS 1 > -#define IN_CORE_KEEP_PACKS 2 > -#define ALL_KEEP_PACKS (ON_DISK_KEEP_PACKS | IN_CORE_KEEP_PACKS) I notice that when the constants moved, we didn't keep an equivalent of ALL_KEEP_PACKS. Maybe we didn't need it in the first place in patch 1? BTW, I absolutely hate the complication that all of this on-disk versus in-core keep distinction brings to this code. And I wondered what it was really doing for us and whether we could get rid of it. But I think we do need it: a common case may be to avoid using --honor-pack-keep (because you don't want to deal with racy .keep writes from incoming receive-pack processes), but use in-core ones for something like --stdin-packs. So we do need to respect one and not the other. I do wonder if things would be simpler if pack-objects simply kept its own list of "in core" packs in a separate array. But that is really just another form of the same problem, I guess. -Peff