Re: [PATCH/RFC 1/5] add a hashtable implementation that supports O(1) removal

Karsten Blees <karsten.blees@xxxxxxxxx> · Mon, 23 Sep 2013 11:16:45 +0200

Sorry for the delay, I've been on vacation...

Am 12.09.2013 01:56, schrieb Junio C Hamano:
> Karsten Blees <karsten.blees@xxxxxxxxx> writes:
> 
>> +#define FNV32_BASE ((unsigned int) 0x811c9dc5)
>> +#define FNV32_PRIME ((unsigned int) 0x01000193)
>> + ...
>> +static inline unsigned int bucket(const hashmap *map, const hashmap_entry *key)
>> +{
>> +	return key->hash & (map->tablesize - 1);
>> +}
> 
> As tablesize would hopefully be reasonably small, not worrying about
> platforms' "unsigned int" being 64-bit (in which case it would be
> more appropriate to compute with FNV64_PRIME) should be fine.
> 
>> +static inline hashmap_entry **find_entry(const hashmap *map,
>> +		const hashmap_entry *key)
>> +{
>> +	hashmap_entry **e = &map->table[bucket(map, key)];
>> +	while (*e && !entry_equals(map, *e, key))
>> +		e = &(*e)->next;
>> +	return e;
>> +}
> 
> (mental note) This finds the location the pointer to the entry is
> stored, not the entry itself.
> 

Will rename to find_entry_ptr to make this clear

>> +void *hashmap_get(const hashmap *map, const void *key)
>> +{
>> +	return *find_entry(map, key);
>> +}
> 
> ... which is consistent with this, and more importantly, it is
> crucial for hashmap_remove()'s implementation, because...
> 
>> +void *hashmap_remove(hashmap *map, const void *key)
>> +{
>> +	hashmap_entry *old;
>> +	hashmap_entry **e = find_entry(map, key);
>> +	if (!*e)
>> +		return NULL;
>> +
>> +	/* remove existing entry */
>> +	old = *e;
>> +	*e = old->next;
> 
> ... this wants to update the linked list in place.
> 
> Looking good.
> 
> I however wonder if the singly linked linear chain is a really good
> alternative for the access pattern of the hashes we use, though.

I don't think it will make a big difference in lookup performance, especially with good hash codes (such as the first bytes of a sha1). In theory, chaining should be slightly faster, because entries are strictly confined to their buckets (i.e. no extraneous entries need to be traversed when looking up an entry). With uniform hash distribution, chaining should require (1 + loadfactor) comparisons to find an entry, while open addressing requires (1/(1 - loadfactor)) [1]. I'll add a performance test for lookups, though.

[1] http://en.wikipedia.org/wiki/Hash_table#Performance_analysis

> Do we really want to trigger growth on the bucket load factor, not the
> length of the longest chain, for example?
> 

With good hashes and a load factor < 1, the longest 'chain' is 1. We only get chains in case of collisions, which however cannot necessarily be resolved by resizing. E.g. in the worst case, all entries have the same hash code, which deforms the hash table into a long linked list in a single bucket. Resizing won't change that.

An alternative would be to resize on the number of used buckets instead of total entries. I.e. with exceptionally bad hash codes and lots of collisions, we at least don't waste memory by making the table unnecessarily large. However, I don't think this is worth the extra effort.

>> +	old->next = NULL;
>> +
>> +	/* fix size and rehash if appropriate */
>> +	map->size--;
>> +	if (map->tablesize > HASHMAP_INITIAL_SIZE &&
>> +		map->size * HASHMAP_SHRINK_AT < map->tablesize)
>> +		rehash(map, map->tablesize >> HASHMAP_GROW);
> 
> Please align the first two lines so that the first non-whitespace on
> the second line of the condition part of the "if" statement
> (i.e. 'm') aligns with the first non-whitespace inside the '(' open
> parenthesis (i.e. 'm').
> 

Will do.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html