Re: [PATCH 01/13] repository: introduce object parser field

Jonathan Tan <jonathantanmy@xxxxxxxxxx> · Wed, 2 May 2018 13:30:05 -0700

On Tue,  1 May 2018 14:33:51 -0700
Stefan Beller <sbeller@xxxxxxxxxx> wrote:

> Git's object access code can be thought of as containing two layers:
> the raw object store provides access to raw object content, while the
> higher level obj_hash code parses raw objects and keeps track of
> parenthood and other object relationships using 'struct object'.
> Keeping these layers separate should make it easier to find relevant
> functions and to change the implementation of one without having to
> touch the other.

I only understood this after reading the code below. I think this
paragraph can be removed; I don't see its relevance - of course we need
to store metadata about how to load objects somewhere, and caching
objects we have already loaded is a good idea: and the metadata and
cache are two separate things before and after this patch anyway.

> Add an object_parser field to 'struct repository' to prepare obj_hash
> to be handled per repository.  Callers still only use the_repository
> for now --- later patches will adapt them to handle arbitrary
> repositories.

I think this is better reworded as:

  Convert the existing global cache for parsed objects (obj_hash) into
  repository-specific parsed object caches. Existing code that uses
  obj_hash are modified to use the parsed object cache of
  the_repository; future patches will use the parsed object caches of
  other repositories.

> +struct object_parser *object_parser_new(void)
> +{
> +	struct object_parser *o = xmalloc(sizeof(*o));
> +	memset(o, 0, sizeof(*o));
> +	return o;
> +}

I'm not sure that I like this style of code (I prefer the strbuf style
with _INIT and release(), which I think is more flexible) but I don't
feel too strongly about it.

> +struct object_parser {
> +	struct object **obj_hash;
> +	int nr_objs, obj_hash_size;
> +};

object_parser is probably a bad name. I think Duy had some good
suggestions in [1].

[1] https://public-inbox.org/git/CACsJy8CgX6BME=EhiDBfMRzBOYDBNHE6Ouxv4fZC-GW7Rsi81Q@xxxxxxxxxxxxxx/

>  	/*
> -	 * Holds any information related to accessing the raw object content.
> +	 * Holds any information needed to retrieve the raw content
> +	 * of objects. The object_parser uses this to get object
> +	 * content which it then parses.
>  	 */
>  	struct raw_object_store *objects;

I think the additional sentence ("The object_parser uses this...") is
unnecessary and confusing, especially if its identity is going to be one
of storage (like "parsed_objects" implies).

> +	/*
> +	 * State for the object parser. This owns all parsed objects
> +	 * (struct object) so callers do not have to manage their
> +	 * lifetime.
> +	 */
> +	struct object_parser *parsed_objects;

Even after all the commits in this patch set, this does not store any
state for parsing. Maybe just document as:

  All objects in this repository that have been parsed. This structure
  owns all objects it references, so users of "struct object *"
  generally do not need to free them; instead, when a repository is no
  longer used, call object_parser_clear() on this structure.

(And maybe say that the freeing method on struct repository will
automatically call object_parser_clear().)