Re: [PATCH 3/3] tests: deterministichash: Make hash tables arch-independent

"Daniel P. Berrange" <berrange@xxxxxxxxxx> · Tue, 15 Aug 2017 15:11:25 +0100

On Tue, Aug 15, 2017 at 10:02:38AM -0400, John Ferlan wrote:
> 
> 
> On 08/15/2017 08:01 AM, Peter Krempa wrote:
> > On Mon, Aug 14, 2017 at 20:46:10 -0400, John Ferlan wrote:
> >> On 08/03/2017 04:50 AM, Peter Krempa wrote:
> > 
> > [ trimmed off-topic part ]
> > 
> >> NB: I didn't dig into the qemumonitorjsontest algorithm, but...
> >>
> >> While going through the common object changes, I ended up looking at and
> >> thinking about the hash table algorithms, initially it was just the
> >> "grow" algorithm as I think it is a bit "aggressive" (perhaps wasteful)
> >> since as soon as 1 bucket chain exceeds 8 elements, the table is resized
> >> by 8 unless the new size is 8 * 2048 - at which time no matter how full
> >> a bucket is we don't resize the table
> > 
> > Resizing the table is very expensive, so it makes sense to scale it
> > aggresively. At some point though it would take too much memory.
> > 
> > If some place in the code expects massive hash table, it can set the
> > hash table to > 2048 manually.
> > 
> 
> Right - avoiding excessive or unnecessary resizing is something I was
> considering. One thing that was at the back of my mind is "somewhat
> recent" discussions about 10s of thousands of disk/volumes. IIRC it's
> been discussed on qemu list as it relates to libguestfs.
> 
> Because volume names tend to be less random than a UUID, I was looking
> at how the existing algorithms operate when you have say "disk00001"
> through "disk99999" w/r/t bucket layout and length of chains when we
> reach the maximum bucket count.

For any sensible hash algorithm, such a "predictable" naming convention
is not a realk problem - indeed if it were a problem it would be considered
a security vulnerability these days.

While murmurhash is not a cryptographic hash, a single bit difference
in names should still produce a very different hashcode value. If we
switch to siphash though, which is a cryptographic hash, we will have
a strong guarantee of a completely different hash code for such names.

So again I don't think bucket size is vs primes is important here - the
choice of hash function more directly determines the collision resistance
we have.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list