Re: [PATCH v2 0/5] New hash table implementation

Karsten Blees <karsten.blees@xxxxxxxxx> · Thu, 26 Sep 2013 16:38:12 +0200

Am 24.09.2013 13:16, schrieb Tay Ray Chuan:
> Hi Karsten,
> 
> On Tue, Sep 24, 2013 at 5:50 PM, Karsten Blees <karsten.blees@xxxxxxxxx> wrote:
>>
>>         |       add        |  get 100% hits  |    get 10% hits
>>         |  hash  | hashmap | hash  | hashmap |  hash   | hashmap
>> --------+--------+---------+-------+---------+---------+--------
>> FNV     | 14.815 |   2.345 | 3.059 |   1.642 |   4.085 |   0.976
>> FNV  x2 | 14.409 |   2.706 | 2.888 |   1.959 |   3.905 |   1.393
>> i       |  7.432 |   1.593 | 1.364 |   1.142 | 413.023 |   0.589
>> i    x2 |  9.169 |   1.866 | 1.427 |   1.163 |   0.757 |   0.670
>> i/10    |  1.800 |   1.555 | 5.365 |   6.465 |  32.918 |   1.052
>> i/10 x2 |  1.892 |   1.555 | 5.386 |   6.474 |   1.123 |   1.206
>>
>> Tests can be reproduced with 'time echo "perfhash[map] <method> 1000" | ./test-hashmap', see test-hashmap.c for definition of method flags.
> 
> I'm not sure if I'm reading the numbers right, but they look impressive!
> 

The numbers are for 100 million additions / lookups (1,000 rounds á 100,000 entries). Considering everything else that happens in git, the hash table performance should be insignificant, though.

> If it's not too much trouble, could you put together an API document,
> along the lines of Documentation/technical/api-hash.txt?

Yes, I had already planned to port the documentation to asciidoc. Although in my experience, API documentation in the header file tends to better stay in sync with code changes (but this only makes real sense with extraction tools such as doxygen).

> I could give
> a stab at replacing patience and histogram diff's hash implementation
> with yours.
> 

Open addressing (i.e. distributing conflicting entries to other buckes) *may* be faster *if* all data fits into the table (i.e. no pointers to the data are used). Scanning such a table (without following pointers) has very high locality and thus may benefit from accessing fewer CPU cache lines. The patience implementation seems to fall into this category (although the entry struct is fairly large, and it also uses the *2 trick to defeat bad hash codes (which wouldn't be necessary with chaining)).

Both patience and histogram use preallocated, fixed-size hash tables, and thus won't benefit from faster inserts (the 'add' performance numbers are for dynamically resized hash tables).

So, converting patience/histogram is probably not worth the trouble for performance reasons alone. If it also simplifies the algorithms and/or reduces memory usage - fine.

Ciao,
Karsten
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html