Re: [PATCH] hashmap: hashmap_get_next passes through keydata as well

Stefan Beller <sbeller@xxxxxxxxxx> · Sat, 13 May 2017 07:06:35 -0700

On Sat, May 13, 2017 at 1:50 AM, Jeff King <peff@xxxxxxxx> wrote:
> On Fri, May 12, 2017 at 01:02:44PM -0700, Stefan Beller wrote:
>
>> The 'keydata' may be of value in the underlying compare function to decide
>> if the given two entries are the same.
>
> I had to scratch my head over this for a minute, because there isn't
> really any motivating example of what you're trying to do.
>
> I think I figured it out, but I have a feeling it is violating the
> intent of the "keydata" parameter.  That parameter is typically used not
> as a pointer to arbitrary auxiliary data, but as a trick for finding a
> hash entry without having to allocate a struct for it.

Yes, I was violating the intent exactly as you describe. I'll adapt my patches
accordingly.

I do not really buy into the trick though, because when the overhead of
allocating a 'key' struct filling in the key parts only leaving out the value
is so much more expensive than giving the key via this keydata argument,
then there are serious issues with your data structure IMHO.
Example:

  struct {
    struct hashmap_entry ent;
    int actual key;
    char value[4096*1024]
  } key_plus_value;

Now when you want to look up a specific key, you don't want to allocate
such a struct on the stack, as we'd be wasting 4M memory of the stack
trashing the caches. However the neat part about this struct is that the
key part is (a) totally separate from the values, and (b) comes before
any value part, such that

  struct {
    struct hashmap_entry ent;
    int actual key;
  } key_only;

  struct key_only = {{NULL, 0}, 42};
  struct key_plus_value *match = hashmap_get(&map, &key, NULL);

works, I would think?

>
> So generally, I'd think two entries in the table should be able to be
> compared on their own merits, even if no keydata is available. Without
> that property, any internal operations in the hashmap can't actually do
> an entry comparison (e.g., a table resize that needs to rehash the
> entries).
>
> It works out in the current code because the chaining is purely linear,
> and doesn't care about order. So we can rehash and just stick the
> elements into a new list. But if it were switched out for a different
> data structure (e.g., a tree), then the hashmap code would need to be
> able to compare elements.

Note that most compare functions do not return an order, but only
a boolean [no]match, so putting it into an ordered tree could only
rely on the hash that we already know without aid from the compare function.
Of course we could fix our compare functions before doing such a
refactoring, but I want to point out how involved that would be.

>
> I don't think we have any particular plans for such a change, but I
> wonder if we should avoid encouraging callers to rely on the current
> implementation.

After a night of sleep it is easy to fix my code to behave as the API
intended. Yesterday I could not see how to fix it.

Thanks,
Stefan

>
> -Peff