Re: [PATCH v2 3/3] libselinux: performance optimization for duplicate detection

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 2023/2/25 22:32, wanghuizhao wrote:
  /*
+ * hash calculation and key comparison of hash table
+ */
+
+static unsigned int symhash(hashtab_t h, const_hashtab_key_t key)
+{
+       const struct chkdups_key *k = (const struct chkdups_key *)key;
+       const char *p = NULL;
+       size_t size;
+       unsigned int val = 0;
+
+       size = strlen(k->regex);
+       for (p = k->regex; ((size_t) (p - k->regex)) < size; p++)
+               val =
+                       ((val << 4) | ((val >> (8 * sizeof(unsigned int) - 4)) +
+                       k->mode)) ^ (*p);
v1 added k->mode after the bit-wise or (probably changed by the added
parenthesis due to the compiler warning).
Using

      (((val << 4) | (val >> (8 * sizeof(unsigned int) - 4))) + k->mode) ^ (*p);

gives a slightly better spread (tested against refpolicy (master)):

     nodups_spec:  6062 entries and 606/606 buckets used, longest chain length 21
vs
     nodups_spec:  6062 entries and 606/606 buckets used, longest chain length 20

And for a adjusted hashtable size (see below):

     nodups_spec:  6062 entries and 3807/6062 buckets used, longest
chain length 6
vs
     nodups_spec:  6062 entries and 3815/6062 buckets used, longest
chain length 6


That's good advice. I compared the two hash calculations with my own test set.

Using

     (((val << 4) | (val >> (8 * sizeof(unsigned int) - 4))) + k->mode) ^ (*p);

The hash result is better.

root ~ # semodule -i myapp1.pp
nodups_specs: 5002 entries and 500/500 buckets used, longest chain length 16
root ~ # semodule -i myapp2.pp
nodups_specs: 10002 entries and 1000/1000 buckets used, longest chain length 20
root ~ # semodule -i myapp3.pp
nodups_specs: 20002 entries and 1539/2000 buckets used, longest chain length 20

adjusted hashtable size:

root ~ # semodule -i myapp1.pp
nodups_specs: 5002 entries and 3550/5002 buckets used, longest chain length 3
root ~ # semodule -i myapp2.pp
nodups_specs: 10002 entries and 7060/10002 buckets used, longest chain length 4
root ~ # semodule -i myapp3.pp
nodups_specs: 20002 entries and 13454/20002 buckets used, longest chain length 4

If I use the hash calculation of the patch v2, the result is as follows:

root ~ # semodule -i myapp1.pp
nodups_specs: 5002 entries and 500/500 buckets used, longest chain length 16
root ~ # semodule -i myapp2.pp
nodups_specs: 10002 entries and 1000/1000 buckets used, longest chain length 22
root ~ # semodule -i myapp3.pp
nodups_specs: 20002 entries and 1310/2000 buckets used, longest chain length 22

adjusted hashtable size:

root ~ # semodule -i myapp1.pp
nodups_specs: 5002 entries and 3551/5002 buckets used, longest chain length 4
root ~ # semodule -i myapp2.pp
nodups_specs: 10002 entries and 5602/10002 buckets used, longest chain length 5
root ~ # semodule -i myapp3.pp
nodups_specs: 20002 entries and 8975/20002 buckets used, longest chain length 6


The result is obvious that the hash calculation of the patch v1 is better. Thank you very much for your review and I'll fix it in the next patch version.


(((val << 4) | (val >> (8 * sizeof(unsigned int) - 4))) + k->mode) ^ (*p);

I found that this hash calculation would cause a problem. In a scenario like this:

(1) regex: /tmp/something  mode: 0

(2) regex: /tmp/something  mode: 123

In the original checking logic, the two are judged as conflicting. However, in the new function implementation, `mode` is added to the hash calculation. As a result, they may have different hash values, so duplicates cannot be detected. I will fix this bug in the next patch.




[Index of Archives]     [Selinux Refpolicy]     [Linux SGX]     [Fedora Users]     [Fedora Desktop]     [Yosemite Photos]     [Yosemite Camping]     [Yosemite Campsites]     [KDE Users]     [Gnome Users]

  Powered by Linux