RE: [PATCH v3 2/7] block alloc: add lifecycle APIs for cache_entry structs

Jameson Miller <jamill@xxxxxxxxxxxxx> · Thu, 24 May 2018 14:47:58 +0000

> -----Original Message-----
> From: Junio C Hamano <jch2355@xxxxxxxxx> On Behalf Of Junio C Hamano
> Sent: Thursday, May 24, 2018 12:52 AM
> To: Jameson Miller <jamill@xxxxxxxxxxxxx>
> Cc: git@xxxxxxxxxxxxxxx; pclouds@xxxxxxxxx; jonathantanmy@xxxxxxxxxx;
> sbeller@xxxxxxxxxx; peartben@xxxxxxxxx
> Subject: Re: [PATCH v3 2/7] block alloc: add lifecycle APIs for cache_entry
> structs
> 
> Jameson Miller <jamill@xxxxxxxxxxxxx> writes:
> 
> > Add an API around managing the lifetime of cache_entry structs.
> > Abstracting memory management details behind an API will allow for
> > alternative memory management strategies without affecting all the
> > call sites.  This commit does not change how memory is allocated /
> > freed. A later commit in this series will allocate cache entries from
> > memory pools as appropriate.
> >
> > Motivation:
> > It has been observed that the time spent loading an index with a large
> > number of entries is partly dominated by malloc() calls. This change
> > is in preparation for using memory pools to reduce the number of
> > malloc() calls made when loading an index.
> >
> > This API makes a distinction between cache entries that are intended
> > for use with a particular index and cache entries that are not. This
> > enables us to use the knowledge about how a cache entry will be used
> > to make informed decisions about how to handle the corresponding
> > memory.
> 
> Yuck.  make_index_cache_entry()?
> 
> Generally we use "cache" when working on the_index without passing istate,
> and otherwise "index", which means that readers can assume that
> distim_cache_entry(...)" is a shorter and more limited way to say
> "distim_index_entry(&the_index, ...)".  Having both index and cache in the same
> name smells crazy.
> 
> If most of the alocations are for permanent kind, give it a shorter name call it
> make_cache_entry(&the_index, ...), and call the other non-permanent one with
> a longer and more cumbersome name, perhaps
> make_transient_cache_entry(...).  Avoid saying "index" in the former name, as
> the design decision this series is making to allocate memory for a cache-entry
> from a pool associated to an index_state is already seen by what its first
> parameter is.

I like this suggestion - I will make this change in the next version of this series.

> 
> > diff --git a/cache.h b/cache.h
> > index f0a407602c..204f788438 100644
> > --- a/cache.h
> > +++ b/cache.h
> > @@ -339,6 +339,29 @@ extern void remove_name_hash(struct index_state
> > *istate, struct cache_entry *ce)  extern void free_name_hash(struct
> > index_state *istate);
> >
> >
> > +/* Cache entry creation and freeing */
> > +
> > +/*
> > + * Create cache_entry intended for use in the specified index. Caller
> > + * is responsible for discarding the cache_entry with
> > + * `discard_cache_entry`.
> > + */
> > +extern struct cache_entry *make_index_cache_entry(struct index_state
> > +*istate, unsigned int mode, const unsigned char *sha1, const char
> > +*path, int stage, unsigned int refresh_options); extern struct
> > +cache_entry *make_empty_index_cache_entry(struct index_state *istate,
> > +size_t name_len);
> > +
> > +/*
> > + * Create a cache_entry that is not intended to be added to an index.
> > + * Caller is responsible for discarding the cache_entry
> > + * with `discard_cache_entry`.
> > + */
> > +extern struct cache_entry *make_transient_cache_entry(unsigned int
> > +mode, const unsigned char *sha1, const char *path, int stage); extern
> > +struct cache_entry *make_empty_transient_cache_entry(size_t
> > +name_len);
> > +
> > +/*
> > + * Discard cache entry.
> > + */
> > +void discard_cache_entry(struct cache_entry *ce);
> 
> I am not yet convinced that it is a good idea to require each istate to hold a
> separate pool.  Anything that uses unpack_trees() can do "starting from this
> src_index, perform various mergy operations and deposit the result in
> dst_index".  Sometimes the two logical indices point at the same istate,
> sometimes different.  When src and dst are different istates, the code that used
> to simply add another pointer to the same ce to the dst index now needs to
> duplicate it out of the pool associated with dst?

I did not see any instances in unpack_trees() where it copied
just the cache_entry pointer from src to dst, but I will check
again.

You are correct, all the cache_entries need to be duplicated
before being added to the destination index, which is what I
think the code already does.  We tried to make this more
explicity by converting the inline xcalloc/memcpy instances to an
actual function.

In the existing code (before this patch series), the index
implicitly "owns" its cache_entry instances. This can be seen in
the discard_index function, where the index will discard any
cache entries that it has a reference to (that are not contained
in the split index). If there is code that just copies the
pointer to an unrelated index, then this cache entry would be
freed when the source index is freed (unless the cache entry is
removed from the src index).

With memory pools, the same pattern is followed, but this
relationship is a bit more explicit.

> 
> In any case, perhaps it will become clearer why it is a good idea as we read on,
> so let's do so.

Thank you for your review so far. I look forward to your thoughts on
the overall change and if it is a change you are interested in.