On Sat, Oct 27, 2018 at 04:04:32PM +0200, Duy Nguyen wrote: > > Of course any cache raises questions of cache invalidation, but I think > > we've already dealt with that for this case. When we use > > OBJECT_INFO_QUICK, that is a sign that we want to make this kind of > > accuracy/speed tradeoff (which does a similar caching thing with > > packfiles). > > We don't care about a separate process adding more loose objects while > index-pack is running, do we? I'm guessing we don't but just to double > check... Right. That's basically what QUICK means: don't bother re-examining the repository to handle simultaneous writes, even if it means saying an object is not there when it has recently appeared. So far it has only applied to packs, but this is really just the same concept (just as we would not notice a new pack arriving, we will not notice a new loose object arriving). > > +/* probably should be configurable? */ > > +#define LOOSE_OBJECT_CACHE_MAX 65536 > > Yes, perhaps with gc.auto config value (multiplied by 256) as the cut > point. If it's too big maybe just go with a bloom filter. For this > particular case we expect like 99% of calls to miss. I wonder, though, if we should have a maximum at all. The existing examples I've found of this technique are: - mark_complete_and_common_ref(), which is trying to cover this exact case. It looks like it avoids adding more objects than there are refs, so I guess it actually has a pretty small cutoff. - find_short_object_filename(), which does the same thing with no limits. And there if we _do_ have a lot of objects, we'd still prefer to keep the cache. And really, this list is pretty much equivalent to looking at a pack .idx. The only difference is that one is mmap'd, but here we'd use the heap. So it's not shared between processes, but otherwise the working set size is similar. -Peff