Re: New (better?) hash map technique in limit case.

Philip Oakley <philipoakley@iee.email> · Sun, 12 Dec 2021 17:43:52 +0000

Hi Glen,
On 10/12/2021 22:52, Glen Choo wrote:
> Philip Oakley <philipoakley@iee.email> writes:
>
>> Recently I saw a report [1] on a new theoretical result about how to
>> manage hash maps which get nearly 'full', which beats Knuth's limit
>> formula. The full paper is at [2]
>>
>> As I understand it, the method adds the gravestone entries early during
>> has collisions to avoid clumping of such collision insertions, rather
>> than always having to enter the collision list at the end. This keeps
>> the available slots relatively randomly spaced.
>>
>> It feels like the old random bus arrival problem where the average wait
>> for next bus is identical to the average time since the last bust, which
>> is the same as the average bus interval (thus 1 + 1 = 1), and the
>> technique maintains that advantageous perception.
>>
>> Given Git's use of hashes, it sounds like it could have uses, assuming
>> the theory pans out. I've not yet gone through the paper itself [2] but
>> hope springs eternal.
>>
>> Philip
>>
>> [1]
>> S. Nadis and M. I. of Technology, “Theoretical breakthrough could boost
>> data storage.”
>> https://techxplore.com/news/2021-11-theoretical-breakthrough-boost-storage.html
>> (accessed Nov. 18, 2021).
>>
>> [2]
>> M. A. Bender, B. C. Kuszmaul, and W. Kuszmaul, “Linear Probing
>> Revisited: Tombstones Mark the Death of Primary Clustering,”
>> arXiv:2107.01250 [cs, math], Jul. 2021, Accessed: Nov. 18, 2021.
>> [Online]. Available: http://arxiv.org/abs/2107.01250
> Very interesting, thanks for sharing. I haven't read the full paper
> either, but this is an interesting result.
>
> It seems that this result is limited to hashmaps with a approximately
> equal number of insertions and deletions..
>
> From [1]
>
>   They found that for applications where the number of insertions and
>   deletions stays about the same—and the amount of data added is roughly
>   equal to that removed—linear-probing hash tables can operate at high
>   storage capacities without sacrificing speed.apacities without
>   sacrificing speed.
>
> and [2]
>
>   ...We then turn our attention to sequences of operations that contain
>   deletions, and show that the tombstones left behind by those deletions
>   have a primary-anti-clustering effect, that is, they have a tendency
>   to speed up future insertions
>
> Do we have any such use cases?
I know that we use hash maps, but haven't followed there actual usage in
various optimisations.

Obviously we use hash naming of objects but that's generally a
red-herring, I think, unless we are over-abbreviating the hash so that
it's no longer unique (which could be happening somewhere).

I suspect that some of the hosting providers may be more interested from
a File system perspective, as I think we just pass the object store
problems to the FS. Then again, all the mono-repo and partial checkout
corporate users are likely to be interested, especially if this unblocks
some historical misunderstanding about the limits and how to handle them.

--
Philip