Re: [PATCH] Replace hashing algorithm with murmurhash

Eric Blake <eblake@xxxxxxxxxx> · Wed, 18 Jan 2012 11:34:16 -0700

On 01/18/2012 09:23 AM, Daniel P. Berrange wrote:
> From: "Daniel P. Berrange" <berrange@xxxxxxxxxx>
> 
> Recent discussions have illustrated the potential for DOS attacks
> with the hash table implementations used by most languages and
> libraries.
> 
>    https://lwn.net/Articles/474912/
> 
> libvirt has an internal hash table impl, and uses hash tables for
> a variety of purposes. The hash key generation code is pretty
> simple and thus not strongly collision resistant.
> 
> This patch replaces the current libvirt hash key generator with
> the (public domain) Murmurhash3 code. In addition every hash
> table now gets a random seed value which is used to perturb the
> hashing code. This should make it impossible to mount any
> practical attack against libvirt hashing code.
> 
> * src/Makefile.am: Add virhashcode.[ch]
> * src/util/util.c: Make virRandom() return a fixed 32 bit
>   integer value.
> * src/util/hash.c, src/util/hash.h, src/util/cgroup.c: Replace
>   hash code generation with a call to virHashCodeGen()
> * src/util/virhashcode.h, src/util/virhashcode.c: Add a new
>   virHashCodeGen() API using the Murmurhash3 algorithm.

Looks like a good idea, but I have to NAK this version of the patch and
ask for a v2.  Meanwhile, I wonder if gnulib should adopt the
murmurhash3 algorithm (but that's a different story for a different list).

> @@ -1636,9 +1637,10 @@ cleanup:
>  }
>  
>  
> -static unsigned long virCgroupPidCode(const void *name)
> +static int32_t virCgroupPidCode(const void *name, int32_t seed)
>  {
> -    return (unsigned long)name;
> +    unsigned long pid = (unsigned long)name;
> +    return virHashCodeGen(&pid, sizeof(pid), seed);

I'm not sure if it matters, but you could shorten these two lines to
just one:

return virHashCodeGen(&name, sizeof(unsigned long), seed);

> @@ -101,11 +95,10 @@ static void virHashStrFree(void *name)
>      VIR_FREE(name);
>  }
>  
> -
>  static unsigned long
>  virHashComputeKey(virHashTablePtr table, const void *name)
>  {
> -    unsigned long value = table->keyCode(name);
> +    uint32_t value = table->keyCode(name, table->seed);
>      return (value % table->size);

If, on a 64-bit machine, table->size ever grows larger than uint32_t,
then we've lost some hashing abilities compared to what the old unsigned
long hash gave us.  Conversely, if table-size ever grows that large,
we're burning an awful lot of memory in the first place, and a hash
collision at that point is the least of our worries :)  I think this is
okay, although you may want to consider changing virHashComputeKey to
return uint32_t.

> +++ b/src/util/hash.h
> @@ -55,12 +55,15 @@ typedef int (*virHashSearcher) (const void *payload, const void *name,
>  /**
>   * virHashKeyCode:
>   * @name: the hash key
> + * @seed: random seed
>   *
> - * Compute the hash code corresponding to the key @name
> + * Compute the hash code corresponding to the key @name, using
> + * @seed to perturb the hashing algorithm
>   *
>   * Returns the hash code
>   */
> -typedef unsigned long (*virHashKeyCode)(const void *name);
> +typedef int32_t (*virHashKeyCode)(const void *name,
> +                                  int32_t seed);

Umm, you used uint32_t in virHashComputeKey, but int32_t here.  While I
don't _think_ signed values will bite us due to sign-extension in 64-bit
operations introducing a bias, I'd rather make this signature:

typedef uint32_t (*virHashKeyCode)(const void *name, uint32_t seed);

and adjust the callers to comply with unsigned everywhere.

> +++ b/src/util/util.c
> @@ -2115,7 +2115,7 @@ int virRandomInitialize(unsigned int seed)
>      return 0;
>  }
>  
> -int virRandom(int max)
> +int32_t virRandom(int32_t max)

Ouch.  This means you only get 31 bits of randomness for
virRandom(INT32_MAX).  But you _want_ 32 bits of randomness, especially
since you are storing _virHashTable.seed as uint32_t.

>  {
>      int32_t ret;
>  
> @@ -2123,7 +2123,7 @@ int virRandom(int max)
>      random_r(&randomData, &ret);
>      virMutexUnlock(&randomLock);
>  
> -    return (int) ((double)max * ((double)ret / (double)RAND_MAX));
> +    return (int32_t) ((double)max * ((double)ret / (double)RAND_MAX));

Furthermore, this calculation leaves gaps on platforms where RAND_MAX is
only the POSIX-mandated minimum of 32767, or 15 bits (thankfully, at
least Linux uses RAND_MAX of INT_MAX for 31 bits).  Furthermore, this
computation is non-uniform for all but a handful of numbers: if you
request a max larger than RAND_MAX, then there are some values you
cannot get back; if you request a max smaller than RAND_MAX but which is
not a power of 2, then floating point division followed by rounding to
integers will cause some numbers to be selected more frequently than others.

I also note that all existing clients of virRandom() passed in a power
of 2 (your use of virRandom(INT_MAX) was the first instance of someone
using less than a power of two; and even then, you really did want a
power of 2).  Since we want to avoid non-uniform distribution, it makes
sense to _require_ that you request a max as a power of 2 (that is, you
are requesting a certain quantity of random bits), and proceed to do
everything without the use of any floating point math.  Technically, it
is also possible to write a capped random number generator for any range
[0,non-power-of-2), by repeatedly querying the generator for random bits
between [0,next-power-of-2) and discarding any queries that fall in the
range [cap,power-of-2); but no need to add that complexity if we promise
to always request a power-of-2 random bits in the first place.

That is, I propose that we ditch virRandom, and instead implement a new
function:

/* Return an evenly distributed random number between [0,2^nbits), where
nbits must be in the range (0,64]. */
uint64_t virRandomBits(int nbits)
{
    int bits_per_iter = count_one_bits(RAND_MAX);
    uint64_t ret = 0;
    int32_t bits;
    /* This algorithm requires that RAND_MAX == 2^n-1 for some n;
       gnulib's random_r meets this property. */
    verify(((RAND_MAX + 1U) & RAND_MAX) == 0);

    virMutexLock(&randomLock);

    while (nbits > bits_per_iter) {
        random_r(&randomData, &bits);
        ret = (ret << bits_per_iter) | (bits & RAND_MAX);
        nbits -= bits_per_iter;
    }

    random_r(&randomData, &bits);
    ret = (ret << nbits) | (bits & ((1 << nbits) - 1));

    virMutexUnlock(&randomLock);
    return ret;
}

where existing calls to virRandom(256) will now be virRandomBits(8),
virRandom(1024*1024*1024) will now be virRandomBits(30), and your code
would use uint32_t seed = virRandom(32).

> +#include "virhashcode.h"
> +
> +static uint32_t rotl(uint32_t x, int8_t r)
> +{
> +    return (x << r) | (x >> (32 - r));
> +}

Gnulib provides the bitrotate module, which provides "bitrotate.h" and
an implementation of this function that is guaranteed to be as efficient
as possible when compiled by gcc (possibly by using compiler-builtins to
access raw assembly on architectures that have builtin rotate
instructions).  I'd suggest using that rather than rolling your own.

> +
> +/* slower than original but is endian neutral and handles platforms that
> + * do only aligned reads */
> +__attribute__((always_inline))
> +static uint32_t getblock(const uint8_t *p, int i)
> +{
> +    uint32_t r;
> +    size_t size = sizeof(uint32_t);
> +
> +    memcpy(&r, &p[i * size], size);
> +
> +    return le32toh(r);

<endian.h>, and thus le32toh(), is not yet standardized (although POSIX
will be adding it in the future), nor is it currently provided by
gnulib.  We'd have to get that fixed first.

-- 
Eric Blake   eblake@xxxxxxxxxx    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

Attachment:
signature.asc

Description: OpenPGP digital signature
--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list