Re: [PATCH 3/3] merge-recursive: change current file dir string_lists to hashmap

Jeff King <peff@xxxxxxxx> · Tue, 29 Aug 2017 04:41:55 -0400

On Mon, Aug 28, 2017 at 02:28:29PM -0600, Kevin Willford wrote:

> The code was using two string_lists, one for the directories and
> one for the files.  The code never checks the lists independently
> so we should be able to only use one list.  The string_list also
> is a O(log n) for lookup and insertion.  Switching this to use a
> hashmap will give O(1) which will save some time when there are
> millions of paths that will be checked.

Makes sense. I think when this code was written we didn't have the
generalized hashmap.c, which is why it relies on string lists.

> diff --git a/merge-recursive.c b/merge-recursive.c
> index d47f40ea81..ef4fe5f09f 100644
> --- a/merge-recursive.c
> +++ b/merge-recursive.c
> @@ -24,6 +24,26 @@
>  #include "dir.h"
>  #include "submodule.h"
>  
> +struct path_hashmap_entry {
> +	struct hashmap_entry;
> +	char path[FLEX_ARRAY];
> +};

This seems like a good use of FLEX_ARRAY.

> +static unsigned int (*pathhash)(const char *path);
> +static int (*pathcmp)(const char *a, const char *b);

These global function pointers are sufficiently out of the ordinary that
it might be worth adding a comment about how they're intended to be
set and used.

I also suspect they could be stuffed into the merge_options struct along
with the hashmap itself, and then passed via the cmp_data pointer (which
is relatively new-ish; I won't be surprised if you wrote this patch
before that existed and forward-ported it).

> +static int path_hashmap_cmp(const void *cmp_data,
> +			    const void *entry,
> +			    const void *entry_or_key,
> +			    const void *keydata)
> +{
> +	const struct path_hashmap_entry *a = entry;
> +	const struct path_hashmap_entry *b = entry_or_key;
> +	const char *key = keydata;
> +
> +	return pathcmp(a->path, key ? key : b->path);
> +}

The hashmap interface makes its cmp functions a bit tricky to write, but
this all looks correct to me.

> @@ -642,6 +662,8 @@ static void add_flattened_path(struct strbuf *out, const char *s)
>  
>  static char *unique_path(struct merge_options *o, const char *path, const char *branch)
>  {
> +	struct path_hashmap_entry dummy;
> +	struct path_hashmap_entry *entry;
>  	struct strbuf newpath = STRBUF_INIT;
>  	int suffix = 0;
>  	size_t base_len;
> @@ -650,14 +672,17 @@ static char *unique_path(struct merge_options *o, const char *path, const char *
>  	add_flattened_path(&newpath, branch);
>  
>  	base_len = newpath.len;
> -	while (string_list_has_string(&o->current_file_set, newpath.buf) ||
> -	       string_list_has_string(&o->current_directory_set, newpath.buf) ||
> +	hashmap_entry_init(&dummy, pathhash(newpath.buf));
> +	while (hashmap_get(&o->current_file_dir_set, &dummy, newpath.buf) ||
>  	       (!o->call_depth && file_exists(newpath.buf))) {
>  		strbuf_setlen(&newpath, base_len);
>  		strbuf_addf(&newpath, "_%d", suffix++);
> +		hashmap_entry_init(&dummy, pathhash(newpath.buf));
>  	}

I think you can use hashmap_get_from_hash() here to avoid dealing with
"dummy" yourself.

> @@ -1941,8 +1966,7 @@ int merge_trees(struct merge_options *o,
>  	if (unmerged_cache()) {
>  		struct string_list *entries, *re_head, *re_merge;
>  		int i;
> -		string_list_clear(&o->current_file_set, 1);
> -		string_list_clear(&o->current_directory_set, 1);
> +		hashmap_init(&o->current_file_dir_set, path_hashmap_cmp, NULL, 512);

This hunk seems funny. string_list_clear() assumes the input is an
already-initialized string list, and we drop all of its entries. But
hashmap_init assumes we have an _uninitialized_ chunk of memory, and
throws away whatever was there.

Do we ever hit this code path when there's actually something in the
hashmap?

I'd think we would want to hashmap_init() at the same place we used to
string_list_init(). I.e., in init_merge_options.

> @@ -1978,6 +2002,8 @@ int merge_trees(struct merge_options *o,
>  		string_list_clear(re_head, 0);
>  		string_list_clear(entries, 1);
>  
> +		hashmap_free(&o->current_file_dir_set, 1);

Ah, I think this answers my question above. We only care about the
contents during this one bit of code, so we do an init/free pair. And
since there's no "clear" function for a hashmap that drops the entries
without resetting things like the comparison function, we _have_ to
call hashmap_init in the earlier hunk.

That does make me wonder if it would be sane to just declare the hashmap
local to this block, which would make its lifetime much more clear. I
guess having it in merge_options is convenient for passing it around,
though.

> @@ -2179,8 +2205,8 @@ void init_merge_options(struct merge_options *o)
> [...]

The rest of the patch looks like a straight-forward conversion. With the
understanding I have above, I think everything here is correct, but
there are a few minor bits it might be worth addressing to make the code
more clear.

-Peff