Re: [PATCH 3/3] tests: deterministichash: Make hash tables arch-independent

John Ferlan <jferlan@xxxxxxxxxx> · Tue, 15 Aug 2017 10:02:38 -0400

On 08/15/2017 08:01 AM, Peter Krempa wrote:
> On Mon, Aug 14, 2017 at 20:46:10 -0400, John Ferlan wrote:
>> On 08/03/2017 04:50 AM, Peter Krempa wrote:
> 
> [ trimmed off-topic part ]
> 
>> NB: I didn't dig into the qemumonitorjsontest algorithm, but...
>>
>> While going through the common object changes, I ended up looking at and
>> thinking about the hash table algorithms, initially it was just the
>> "grow" algorithm as I think it is a bit "aggressive" (perhaps wasteful)
>> since as soon as 1 bucket chain exceeds 8 elements, the table is resized
>> by 8 unless the new size is 8 * 2048 - at which time no matter how full
>> a bucket is we don't resize the table
> 
> Resizing the table is very expensive, so it makes sense to scale it
> aggresively. At some point though it would take too much memory.
> 
> If some place in the code expects massive hash table, it can set the
> hash table to > 2048 manually.
> 

Right - avoiding excessive or unnecessary resizing is something I was
considering. One thing that was at the back of my mind is "somewhat
recent" discussions about 10s of thousands of disk/volumes. IIRC it's
been discussed on qemu list as it relates to libguestfs.

Because volume names tend to be less random than a UUID, I was looking
at how the existing algorithms operate when you have say "disk00001"
through "disk99999" w/r/t bucket layout and length of chains when we
reach the maximum bucket count.

Still even with that, resizing a table just because one hash bucket
chain goes above 8 just seemed unnecessary. Still a work in progress
though - I may come up with nothing. It's the classic power struggle of
time vs. space.

>> The next thing I considered was using a prime number as the table bucket
>> size and it seems by just doing that will cause qemumonitorjsontest to
> 
> What would be the benefit of such change? I don't really see how prime
> numbers would improve anything performance-wise.
> 

Nothing performance wise directly that comes to mind. I see that while
composing Michal and Daniel have responded.

FWIW: My hacking has altered virhashtest.c to generate 1000's of UUIDs
and "elemN" names in order to check if/when one or the other has a
bucket full problem with the existing algorithms.  It's been a couple
weeks since I was looking at the data - so results are not fresh in mind
especially since I was on PTO last week.

>> fail. Easy enough to reproduce by changing the virHashCreate call in
>> qemuBlockNodeNameGetBackingChain to use 53 instead of 50 (even 49 causes
>> failure). It doesn't matter if the test source also adjusts the initial
>> hash size.
>>
>> Of course this makes sense given that virHashCodeGen is the backend of
>> the virHashComputeKey algorithm that uses the return value %
>> table->size. So if "size" changes, then obviously order changes.
> 
> As you've said that is expected. I don't quite follow what is your
> point here.
> 

I was pointing out that by changing the size you get different ordered
results, nothing more, nothing less. It just so happens that 50 returns
this set of results, but 53 is different.

>> While I get the code bloat conundrum, but if the order doesn't matter
>> then I wonder what other variables could change that would affect this
>> test and whether we really want to be constantly altering the mocks
>> "just" to get the right answer for this test.
> 
> The order depends on the result of the hasing function, the number of
> buckets and the order you've added the entries. All of those is
> expected when dealing with a hash table.
> 
> After this last patch, you are guaranteed to have the hash function
> return deterministic results (even on big endian-boxes).
> 

Change the size to 53, rebuild, and run the test. It'll take a few
minutes, but I think you'd see a failure.

> The hash size is not really altered very often since you can't really
> see a measurable performance benefit in most cases. This is also
> strenghtened by the fact, that the hash function is randomized by a
> random number, so in real usage you won't be able to deterministically
> cause hash collisions.

True... at 50 with 4 elements you won't see a resize. You could possibly
have a table size of 1 in this instance and be fine, but that only works
well for the test! It's more problematic for real world because a table
size of 1 grows to 8, 64, etc.  Some would say the minimum table size
should be at least 8 if not the next prime, e.g. 11.

In the long run, UUID's are random, but "#blockN" values are far less
random - it's the human randomness factor. No one is going to create 100
randomly named domain names or volume names, so factoring in the less
than randomly named elements is what I've been pondering.

> 
> In the test code, the cases are added in a deterministic order. Since
> the hashing function in the tests is deterministic as well, the only
> thing that would influence the ordering is the bucket count. We don't
> modify them too often.
> 
> I don't see what would make us alter mocks "just" to get the right
> answer here. The reason for this patch was, that the hash function is
> using bit operations and then treating the result as integers, which
> obviously does not work deterministically between big and little endian
> arches. Other than that, the test is now fully deterministic.
> 
> Peter
> 

Understood - until someone changes the size of the table to be random or
less deterministic, then there's no problem.  But if that happens,
there's a problem.

John

--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list