Re: [PATCH 1/2] fast-import: use struct hash_table for atom strings

Jonathan Nieder <jrnieder@xxxxxxxxx> · Fri, 1 Apr 2011 21:42:09 -0500

Hi,

David Barr wrote:

> Signed-off-by: David Barr <david.barr@xxxxxxxxxxxx>

Thanks, this is a welcome change.  But perhaps it would be nice to
explain why, here? :)

E.g., what is stored in the atom table? does it tend to get big?  does
the existing code allow it to grow? this change will allow it to grow,
right? what is the downside to this change (if any)?

Especially, numbers (timings) illustrating the effect on typical
use and effect on scalability would be interesting.

> ---
>  fast-import.c |   17 ++++++++++-------
>  1 files changed, 10 insertions(+), 7 deletions(-)
> 
> diff --git a/fast-import.c b/fast-import.c
> index 65d65bf..0592b21 100644
> --- a/fast-import.c
> +++ b/fast-import.c
> @@ -300,9 +300,8 @@ static size_t total_allocd;
>  static struct mem_pool *mem_pool;
>  
>  /* Atom management */
> -static unsigned int atom_table_sz = 4451;
>  static unsigned int atom_cnt;
> -static struct atom_str **atom_table;
> +static struct hash_table atom_table;
>  
>  /* The .pack file being generated */
>  static unsigned int pack_id;
> @@ -680,10 +679,11 @@ static struct object_entry *find_mark(uintmax_t idnum)
>  
>  static struct atom_str *to_atom(const char *s, unsigned short len)
>  {
> -	unsigned int hc = hc_str(s, len) % atom_table_sz;
> +	unsigned int hc = hc_str(s, len);
>  	struct atom_str *c;
> +	void **pos;
>  
> -	for (c = atom_table[hc]; c; c = c->next_atom)
> +	for (c = lookup_hash(hc, &atom_table); c; c = c->next_atom)
>  		if (c->str_len == len && !strncmp(s, c->str_dat, len))
>  			return c;
>  
> @@ -691,8 +691,12 @@ static struct atom_str *to_atom(const char *s, unsigned short len)
>  	c->str_len = len;
>  	strncpy(c->str_dat, s, len);
>  	c->str_dat[len] = 0;
> -	c->next_atom = atom_table[hc];
> -	atom_table[hc] = c;
> +	c->next_atom = NULL;
> +	pos = insert_hash(hc, c, &atom_table);
> +	if (pos) {
> +		c->next_atom = *pos;
> +		*pos = c;
> +	}

If I understand correctly, this puts new atoms at the start of the
chain, just like v1.7.4-rc0~40^2 (fast-import: insert new object
entries at start of hash bucket, 2010-11-23) did for objects.  Did you
measure and find this faster, or is it just for simplicity or
consistency?  (I'd personally be fine with it either way, but it seems
prudent to ask.)

>  	atom_cnt++;
>  	return c;
>  }
> @@ -3263,7 +3267,6 @@ int main(int argc, const char **argv)
>  
>  	alloc_objects(object_entry_alloc);
>  	strbuf_init(&command_buf, 0);
> -	atom_table = xcalloc(atom_table_sz, sizeof(struct atom_str*));
>  	branch_table = xcalloc(branch_table_sz, sizeof(struct branch*));
>  	avail_tree_table = xcalloc(avail_tree_table_sz, sizeof(struct avail_tree_content*));
>  	marks = pool_calloc(1, sizeof(struct mark_set));

We never call init_hash.  That's technically safe because init_hash
just zeroes out the table, but I think I'd rather see us using it
anyway or documenting in api-hash.txt that it's safe not to use.

Looks good.  Will queue to give it some testing.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html