On 08/15/2017 08:01 AM, Peter Krempa wrote: > On Mon, Aug 14, 2017 at 20:46:10 -0400, John Ferlan wrote: >> On 08/03/2017 04:50 AM, Peter Krempa wrote: > > [ trimmed off-topic part ] > >> NB: I didn't dig into the qemumonitorjsontest algorithm, but... >> >> While going through the common object changes, I ended up looking at and >> thinking about the hash table algorithms, initially it was just the >> "grow" algorithm as I think it is a bit "aggressive" (perhaps wasteful) >> since as soon as 1 bucket chain exceeds 8 elements, the table is resized >> by 8 unless the new size is 8 * 2048 - at which time no matter how full >> a bucket is we don't resize the table > > Resizing the table is very expensive, so it makes sense to scale it > aggresively. At some point though it would take too much memory. > > If some place in the code expects massive hash table, it can set the > hash table to > 2048 manually. > Right - avoiding excessive or unnecessary resizing is something I was considering. One thing that was at the back of my mind is "somewhat recent" discussions about 10s of thousands of disk/volumes. IIRC it's been discussed on qemu list as it relates to libguestfs. Because volume names tend to be less random than a UUID, I was looking at how the existing algorithms operate when you have say "disk00001" through "disk99999" w/r/t bucket layout and length of chains when we reach the maximum bucket count. Still even with that, resizing a table just because one hash bucket chain goes above 8 just seemed unnecessary. Still a work in progress though - I may come up with nothing. It's the classic power struggle of time vs. space. >> The next thing I considered was using a prime number as the table bucket >> size and it seems by just doing that will cause qemumonitorjsontest to > > What would be the benefit of such change? I don't really see how prime > numbers would improve anything performance-wise. > Nothing performance wise directly that comes to mind. I see that while composing Michal and Daniel have responded. FWIW: My hacking has altered virhashtest.c to generate 1000's of UUIDs and "elemN" names in order to check if/when one or the other has a bucket full problem with the existing algorithms. It's been a couple weeks since I was looking at the data - so results are not fresh in mind especially since I was on PTO last week. >> fail. Easy enough to reproduce by changing the virHashCreate call in >> qemuBlockNodeNameGetBackingChain to use 53 instead of 50 (even 49 causes >> failure). It doesn't matter if the test source also adjusts the initial >> hash size. >> >> Of course this makes sense given that virHashCodeGen is the backend of >> the virHashComputeKey algorithm that uses the return value % >> table->size. So if "size" changes, then obviously order changes. > > As you've said that is expected. I don't quite follow what is your > point here. > I was pointing out that by changing the size you get different ordered results, nothing more, nothing less. It just so happens that 50 returns this set of results, but 53 is different. >> While I get the code bloat conundrum, but if the order doesn't matter >> then I wonder what other variables could change that would affect this >> test and whether we really want to be constantly altering the mocks >> "just" to get the right answer for this test. > > The order depends on the result of the hasing function, the number of > buckets and the order you've added the entries. All of those is > expected when dealing with a hash table. > > After this last patch, you are guaranteed to have the hash function > return deterministic results (even on big endian-boxes). > Change the size to 53, rebuild, and run the test. It'll take a few minutes, but I think you'd see a failure. > The hash size is not really altered very often since you can't really > see a measurable performance benefit in most cases. This is also > strenghtened by the fact, that the hash function is randomized by a > random number, so in real usage you won't be able to deterministically > cause hash collisions. True... at 50 with 4 elements you won't see a resize. You could possibly have a table size of 1 in this instance and be fine, but that only works well for the test! It's more problematic for real world because a table size of 1 grows to 8, 64, etc. Some would say the minimum table size should be at least 8 if not the next prime, e.g. 11. In the long run, UUID's are random, but "#blockN" values are far less random - it's the human randomness factor. No one is going to create 100 randomly named domain names or volume names, so factoring in the less than randomly named elements is what I've been pondering. > > In the test code, the cases are added in a deterministic order. Since > the hashing function in the tests is deterministic as well, the only > thing that would influence the ordering is the bucket count. We don't > modify them too often. > > I don't see what would make us alter mocks "just" to get the right > answer here. The reason for this patch was, that the hash function is > using bit operations and then treating the result as integers, which > obviously does not work deterministically between big and little endian > arches. Other than that, the test is now fully deterministic. > > Peter > Understood - until someone changes the size of the table to be random or less deterministic, then there's no problem. But if that happens, there's a problem. John -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list