Re: [RFC PATCH v2] khash_test.c : add unit tests

Glen Choo <chooglen@xxxxxxxxxx> · Thu, 29 Jun 2023 17:53:16 -0700

Siddharth Singh <siddhartth@xxxxxxxxxx> writes:

> - rewrote the code keeping coding style guidelines in mind.

I noticed several style issues in the patch. I'll avoid commenting on
those here so that we can focus on the other, big questions.

> I'm also very new to writing unit tests in C and git codebase in general, so I'll appreciate constructive feedback and discussion..

Okay. I think this is a good chance for the ML to decide on what sorts
of unit test practices we want too.

> What are the unit test cases to include in khash.h?
> What are the other types of tests that would be useful for khash.h?

For a third party library, I mostly agree with upthread commenters that
testing the correctness of the _library_ code isn't that valuable on its
own, but unit tests are also useful for demonstrating how the code
should be used. In a lot of projects, unit tests are a pseudo-official
demonstration of how an interface should be used, and that's sometimes
better than seeing the code in production, because unit tests are
typically a lot simpler and easier to reason about.

As an aside, this is something that a unit test framework will do _much
better_ than test-tool, because it makes it easy to read and write these
idtiomatic demos without worrying about how the test-tool executable
will be used.

I think that if we consider these two factors together (correctness and
developer education), the fact that we don't have great documentation of
khash.h in-tree, and that the external docs aren't that comprehensive
either, I think it would be useful to exercise each of the khash
'functions' to show how they work.

As Taylor mentioned upthread, I think the most valuable thing to test is
the way our code uses khash.h (not the library code itself). IOW:

  KHASH_INIT(oid_set, struct object_id, int, 0, oidhash_by_value, oideq_by_value)
  KHASH_INIT(oid_map, struct object_id, void *, 1, oidhash_by_value, oideq_by_value)
  KHASH_INIT(oid_pos, struct object_id, int, 1, oidhash_by_value, oideq_by_value)

So we would focus on oidhash_by_value and oideq_by_value in particular,
since that's provided by us. It would be very useful to verify that
these functions are doing exactly what we think they should be doing,
and we could do this with _every_ khash function to really explore the
API surface.

In fact... I suspect that they _not_ doing what we think they should,
or at the very least, might be displaying some very unintuitive
behavior. object_id contains both the hash value _and_ the hash
algorithm that generated it. If we have the same hash value, but
generated by different algorithms, are they considered the same equal or
not? To me, the intuitive answer is "no", but oidhash_by_value() only
considers the hash value and ignores the algorithm altogether! I think
that testing collisions between SHA1 and SHA256 values would be very
useful.

> +/*
> +  These tests are for base assumptions; if they fails, library is broken
> +*/
> +int hash_string_succeeds() {
> +  const char *str = "test_string";
> +  khint_t value = __ac_X31_hash_string(str);
> +  return value == 0xf1a6305e;
> +}
> +
> +int hash_string_with_non_alphabets_succeeds() {
> +  const char *str = "test_string #2";
> +  khint_t value = __ac_X31_hash_string(str);
> +  return value == 0xfa970771;
> +}

For a unit test, it is a best practice to test the external interface
that callers are depending on, and to avoid implementation
details/internals. If we do the latter, we end up with brittle tests
that break on non-important changes. E.g. we definitely should not be
testing the exact value of the hash function. The khash library
maintainer could decide to change the hash function, and we should not
be broken by that.

For khash.h, it appears that that anything prepended with "__" is meant
to be internal, and everything else is external. E.g.
__ac_X31_hash_string is certainly not meant to be used directly by end
users., especially since there's a human-friendly macro for it:

  #define kh_str_hash_func(key) __ac_X31_hash_string(key)

However, because something is external, doesn't mean we need to be
testing its implementation - we still shouldn't be testing the
return value of kh_str_hash_func(). We may want to test certain
_properties_ of the external function (e.g. does it provide good
collision resistance?), but in this case, I think we can trust that what
the library author wrote is a good default.

> +KHASH_INIT(str, const char *, int *, 1, kh_str_hash_func, kh_str_hash_equal);

As mentioned earlier, I think testing oid_[set|map|pos] would be much
more useful, and we should try to demonstrate each of the
macros/functions and how they interact, e.g. if I kh_put_*, I should
expect to be able to kh_get_* afterwards.

> +
> +int initialized_hashmap_pointer_not_null() {
> +  kh_str_t *hashmap = kh_init_str();
> +
> +  if(hashmap != NULL){
> +    free(hashmap);
> +    return 1;
> +  }
> +  return 0;
> +}

I don't think this is a useful test. It would be extremely weird for the
return value to be NULL in regular usage.

> +int put_key_into_hashmap_once_succeeds() {
> +  int ret, value;
> +  khint_t pos;
> +  kh_str_t *hashmap = kh_init_str();
> +  pos = kh_put_str(hashmap, "test_key", &ret);
> +  if(!ret)
> +    return 0;
> +  return pos != 0;
> +}

I don't think it makes sense to only check the return value, we should
check that it had the intended effect - that we can read back the value.
There is a later test that does this, so I'd prefer we remove this.

Comment on the test framework, not the test: I think it makes sense to
assert that "ret" is nonzero, but this "if..return" way of asserting
makes it impossible to tell what has broken. It also adds a lot of
cognitive overhead when trying to parse the assertion. I'll echo Phillip
Wood's comments and say that I think this is definitely not a framework
I want to use.

> +int put_same_key_into_hashmap_twice_fails() {
> +  int first_return_value, second_return_value, value;
> +  khint_t pos;
> +  kh_str_t *hashmap = kh_init_str();
> +  kh_put_str(hashmap, "test_key", &first_return_value);
> +  kh_put_str(hashmap, "test_key", &second_return_value);
> +  return !second_return_value;
> +}

This test seems reasonable.

> +int put_value_into_hashmap_once_succeeds() {
> +  int ret, value = 14;
> +  khint_t pos;
> +  kh_str_t *hashmap = kh_init_str();
> +  pos = kh_put_str(hashmap, "test_key", &ret);
> +  if (!ret)
> +    return 0;
> +  kh_key(hashmap, pos) = xstrdup("test_key");
> +  kh_value(hashmap, pos) = &value;
> +  return *kh_value(hashmap, pos) == value;
> +}

Checking the inserted value makes sense to me I don't really understand
khash.h, but is this the recommended way to get values back? kh_value()
looks like just a small macro around an internal array, so we are just
checking that *&value == value. I would have expected that we would call
kh_get to get "pos", then call kh_value on that.

Also, do we have to kh_put_str("some_key") and then immediately set it
again with kh_key(pos)? That seems odd, and I don't see other callers
doing that all the time.

Since this is the second time I see "if (!ret)", I should ask whether
this is meant as an assertion (checking the correctness of the code), or
for flow control (the code might return either 0 or 1, and I need to
handle both). If it is the former, I think it's unnecessary to
assert on it multiple times. The latter does not belong in a unit test -
we should already _know_ what we want to be returned. Btw, it is a huge
flaw in the test framework that both look exactly the same, so I'll say
again that I really don't like it.

> +int put_same_value_into_hashmap_twice_fails() {
> +  int first_return_value, second_return_value, value = 14;
> +  khint_t pos;
> +  kh_str_t *hashmap = kh_init_str();
> +  pos = kh_put_str(hashmap, "test_key", &first_return_value);
> +  if (!first_return_value)
> +    return 0;
> +  kh_key(hashmap, pos) = xstrdup("test_key");
> +  kh_value(hashmap, pos) = &value;
> +  kh_put_str(hashmap, "test_key", &second_return_value);
> +  return !second_return_value;
> +}

I don't see how this is different from the previous test that tested for
collisions.

> +int get_existing_kv_from_hashmap_succeeds() {
> +  int ret, value =14;
> +  int expected = value;
> +  khint_t pos;
> +  kh_str_t *hashmap = kh_init_str();
> +  pos = kh_put_str(hashmap, "test_key", &ret);
> +  if (!ret)
> +    return 0;
> +  kh_key(hashmap, pos) = xstrdup("test_key");
> +  kh_value(hashmap, pos) = &value;
> +  return *kh_value(hashmap, kh_get_str(hashmap, "test_key")) == expected;
> +}

Isn't this identical to put_value_into_hashmap_once_succeeds()?

Also, this assertion is harder to read than necessary because we have
both "expected" and "value" - the different names suggest that they are
different, but they're really the same. Let's just use one of them.

> +int get_nonexisting_kv_from_hashmap_fails() {
> +  int value = 14;
> +  khint_t pos;
> +  kh_str_t *hashmap = kh_init_str();
> +  return !kh_get_str(hashmap, "test_key");
> +}

Looks reasonable.

> +int deletion_from_hashmap_sets_flag() {
> +  int ret, value = 14;
> +  khint_t pos;
> +  kh_str_t *hashmap = kh_init_str();
> +  pos = kh_put_str(hashmap, "test_key", &ret);
> +  if (!ret)
> +    return 0;
> +  if(!kh_exist(hashmap, pos))
> +    return 0;
> +  kh_key(hashmap, pos) = xstrdup("test_key");
> +  kh_value(hashmap, pos) = &value;
> +  kh_del_str(hashmap, pos);
> +  return !kh_exist(hashmap, pos);
> +}

What is the "flag" being set? If it is the khash-internal flag that
marks that an entry is deleted, that's an unimportant implementation
detail - the important thing is that the value is deleted. Let's just
rename this to "deletion_works" or something.

> +
> +int deletion_from_hashmap_updates_size() {
> +  int ret, value = 14, current_size;
> +  khint_t pos;
> +  kh_str_t *hashmap = kh_init_str();
> +  pos = kh_put_str(hashmap, "test_key", &ret);
> +  if (!ret)
> +    return 0;
> +  kh_key(hashmap, pos) = xstrdup("test_key");
> +  kh_value(hashmap, pos) = &value;
> +  current_size = hashmap->size;
> +  kh_del_str(hashmap, pos);
> +  return hashmap->size + 1 == current_size;
> +}

The interface is kh_size(), so let's use that instead of referencing the
.size member correctly.

> +int deletion_from_hashmap_does_not_update_kv() {
> +  int ret, value = 14, current_size;
> +  khint_t pos;
> +  kh_str_t *hashmap = kh_init_str();
> +  pos = kh_put_str(hashmap, "test_key", &ret);
> +  if (!ret)
> +    return 0;
> +  kh_key(hashmap, pos) = xstrdup("test_key");
> +  kh_value(hashmap, pos) = &value;
> +  kh_del_str(hashmap, pos);
> +  return !strcmp(kh_key(hashmap, pos),"test_key");
> +}

Interesting! If we kh_del_(pos) but read it back with kh_key(), we can
still read it back. This has some implications on how khash should be
used - we should be extremely careful if we pass around a returned value
from kh_get_ or kh_put_ because we don't know if it's deleted or not.
Instead, we should prefer kh_get_(). Very useful demonstration.

> +int update_value_after_deletion_succeeds() {
> +  int ret, value1 = 14, value2 = 15;
> +  khint_t pos1, pos2;
> +  kh_str_t *hashmap = kh_init_str();
> +  // adding the kv to the hashmap
> +  pos1 = kh_put_str(hashmap, "test_key", &ret);
> +  if (!ret)
> +    return 0;
> +  kh_key(hashmap, pos1) = xstrdup("test_key");
> +  kh_value(hashmap, pos1) = &value1;
> +  // delete the kv from the hashmap
> +  kh_del_str(hashmap, pos1);
> +  // adding the same key with different value
> +  pos2 = kh_put_str(hashmap, "test_key", &ret);
> +  if (!ret)
> +    return 0;
> +  kh_key(hashmap, pos2) = xstrdup("test_key");
> +  kh_value(hashmap, pos2) = &value2;
> +  // check if the value is different
> +  return *kh_value(hashmap, kh_get_str(hashmap, "test_key")) == value2;
> +}

Looks reasonable.

> +int hashmap_size_matches_number_of_added_elements() {
> +  int ret, value1 = 14, value2 = 15, value3 = 16;
> +  khint_t pos1, pos2, pos3;
> +  kh_str_t *hashmap = kh_init_str();
> +  pos1 = kh_put_str(hashmap, "test_key1", &ret);
> +  if (!ret)
> +    return 0;
> +  kh_key(hashmap, pos1) = xstrdup("test_key1");
> +  kh_value(hashmap, pos1) = &value1;
> +  pos2 = kh_put_str(hashmap, "test_key2", &ret);
> +  if (!ret)
> +    return 0;
> +  kh_key(hashmap, pos2) = xstrdup("test_key2");
> +  kh_value(hashmap, pos2) = &value2;
> +  pos3 = kh_put_str(hashmap, "test_key3", &ret);
> +  if (!ret)
> +    return 0;
> +  kh_key(hashmap, pos3) = xstrdup("test_key3");
> +  kh_value(hashmap, pos3) = &value3;
> +  return kh_size(hashmap) == 3;
> +}

Looks reasonable.

I would also exercise kh_end, kh_begin, kh_clear, kh_foreach and
kh_foreach_value

kh_destroy might be useful to test too, in case there are oddities
there, but it's probably not that interesting

kh_n_buckets doesn't seem useful to test - it looks too specific to the
impkementation. kh_resize looks like an implementation detail that we
shouldn't test either.