Re: APIs for qp-trie //Re: Question: Is it OK to assume the address of bpf_dynptr_kern will be 8-bytes aligned and reuse the lowest bits to save extra info ?

Hou Tao <houtao@xxxxxxxxxxxxxxx> · Tue, 16 Jul 2024 21:05:05 +0800

Hi,

On 6/27/2024 10:15 PM, Hou Tao wrote:
> Hi,
>
> On 6/27/2024 11:34 AM, Alexei Starovoitov wrote:
>> On Tue, Jun 25, 2024 at 9:30 PM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
>>> Hi,
>>>
>>> On 6/26/2024 10:06 AM, Alexei Starovoitov wrote:
>>>> On Mon, Jun 24, 2024 at 7:12 AM Hou Tao <houtao@xxxxxxxxxxxxxxx> wrote:
>>>>> Hi,
>>>>>
>>>>> Sorry to resurrect the old thread to continue the discussion of APIs for
>>>>> qp-trie.
>>>>>
> SNIP
>>>>> (3) libbpf API
>>>>>
>>>>> Add bpf_map__get_next_sized_key() to high level APIs.
>>>>>
>>>>> LIBBPF_API int bpf_map__get_next_sized_key(const struct bpf_map *map,
>>>>>                                            const void *cur_key,
>>>>>                                            size_t cur_key_sz,
>>>>>                                            void *next_key, size_t
>>>>> *next_key_sz);
>>>>>
>>>>> Add
>>>>> bpf_map_update_sized_elem()/bpf_map_lookup_sized_elem()/bpf_map_delete_sized_elem()/bpf_map_get_next_sized_key()
>>>>> to low level APIs.
>>>>> These APIs have already considered the case in which map has
>>>>> variable-size value, so there will be no need to add other new APIs to
>>>>> support such case.
>>>>>
>>>>> LIBBPF_API int bpf_map_update_sized_elem(int fd, const void *key, size_t
>>>>> key_sz,
>>>>>                                          const void *value, size_t value_sz,
>>>>>                                          __u64 flags);
>>>>> LIBBPF_API int bpf_map_lookup_sized_elem(int fd, const void *key, size_t
>>>>> key_sz,
>>>>>                                          void *value, size_t *value_sz,
>>>>>                                          __u64 flags);
>>>>> LIBBPF_API int bpf_map_delete_sized_elem(int fd, const void *key, size_t
>>>>> key_sz,
>>>>>                                          __u64 flags);
>>>>> LIBBPF_API int bpf_map_get_next_sized_key(int fd,
>>>>>                                           const void *key, size_t key_sz,
>>>>>                                           void *next_key, size_t
>>>>> *next_key_sz);
>>>> I don't like this approach.
>>>> It looks messy to me and solving one specific case where
>>>> key/value is a blob of bytes.
>>>> In other words it's taking api to pre-BTF days when everything
>>>> was an opaque blob.
>>> I see.
>>>> I think we need a new object dynptr-like that is composable with other types.
>>>> So that user can say that key is
>>>> struct map_key {
>>>>    long foo;
>>>>    dynptr_like array;
>>>>    int bar;
>>>> };
>>>>
>>>> I'm not sure whether the existing bpf_dynptr fits exactly, but it's
>>>> close enough.
>>>> Such dynptr_like object should be able to be used as a string.
>>>> And map should allow two such strings:
>>>> struct map_key {
>>>>    dynptr_like file_name;
>>>>    dynptr_like dir;
>>>> };
>>>>
>>>> and BTF for such map should see distinguish it as two strings
>>>> and not as a single blob of bytes.
>>>> The observability of bpf maps with bpftool should be able to print it.
>>>>
>>>> The use of such api will look the same from bpf prog and from user space.
>>>> bpf prog can do:
>>>>
>>>>  struct map_key key;
>>>>  bpf_dynptr_from_whatever(&key.file_name, ...);
>>>>  bpf_dynptr_from_whatever(&key.dir, ...);
>>>>  bpf_map_lookup_elem(map, &key);
>>>>
>>>> and similar from user space.
>>>> bpf_dynptr_user will be a struct with size and a pointer.
>>>> The existing sys_bpf commands will stay as-is.
>>>> The user space will do:
>>>>
>>>> struct map_key {
>>>>    bpf_dynptr_user file_name;
>>>>    bpf_dynptr_user dir;
>>>> } key;
>>>>
>>>> key.dir.size = 1000;
>>>> key.dir.ptr = malloc(1000);
>>>> ...
>>>> bpf_map_lookup_elem( &key); // existing syscall cmd
>>>>
>>>> In this case sizeof(struct map_key) == sizeof(bpf_dynptr_user) * 2 == 32
>>>>
>>>> Both for bpf prog and for user space.
>>> It seems the idea could be implemented through both hash-table and qp-trie.
>> yes. I think so.
>>
>>> For hash-table, firstly we need to keep each offset of these dynptr_like
>>> objects. During update operation, we need to calculate the hash for each
>>> dynptr_like object and combine these hashes into a new hash. During
>>> lookup, we need to compare each dynptr_like object alone to check
>>> whether or not it is the same as the target element.
>> yep.
>> We already have btf_field/btf_record infra that describe map value.
>> They can be used to describe map key just as well.
>> The tricky part would be to make the whole hash of dynptr-s and compare
>> quick enough without hurting common use case of traditional hash map.
>>
>>> For qp-trie, we also need to keep the offset for each dynptr_like
>>> object. During update operation, we should marshal the passed key into a
>>> plain blob and save the plain blob in qp-trie. During lookup, we don't
>>> marshal the input key, instead we lookup up the qp-trie by using each
>>> field in the map key step-wise.
>> yes. exactly.
>>
>>> However for get_next_key operation, we
>>> need to unmarshal the plain blob into a dynptr_like object.
>> hmm. I haven't thought about get_next_key for qp-trie.
>> A bit tricky indeed.
>>
>>> For the two hypothetical implementations above, I think the lookup
>>> performance may be better than qp-trie and its memory usage will not be
>>> bad, so I prefer to support dynptr_like object in hash map key first. WDYT ?
>> I'd actually do it with qp-trie first.
>> bpftrace is using strings as keys a lot and some scripts use "multi key" too
>> (like string plus int)
>> Currently bpftrace sizes up hash key_size to max and it's not efficient.
>> I think qp-trie with multi-key support would work very well for that use case.
>>
>> iirc early version of qp-trie was limited to nul-terminated strings
>> as a key, but I believe it's possible to tweak the algorithm to support
>> binary key. Then what you describing above how lookup/update of a key
>> will work nicely.
>> I also suspect that lookup will be much faster in qp-trie compared
>> to hash table, hence that's what I would implement first.
> The first version of qp-trie had already supported to use arbitrary
> bytes as key [1]. The lookup performance comparison between qp-trie and
> hash-table varies according to the benchmark result in the early
> patch-set [1]. For normal strings (e.g., strings in BTF or kallsyms),
> hash-table performance better. I will try whether or not it is possible
> to work out a hack version for both hash-table and qp-trie to compare
> the lookup performance first.

I had hacked bpf verifier (mainly check_stack_range_initialized()) to
support variable-sized key for both qp-trie and hash-table. For now,
only one bpf_dynptr is allowed in the key. I also update the benchmark
in qp-trie patch-set [1] to compare the lookup performance between
normal hash-table, hash-table with variable-sized key (namely
dynkey-htab), and qp-trie. The key for hash-table with variable-sized
key and qp-trie is shown below:

struct test_key {
        __u64 cookie;
        struct bpf_dynptr_user desc;
};

For randomly generated dynptr key, when the max length of ->desc is big
(e.g., >128) the performance of qp-trie is the best and it is 20%~80%
faster then dynkey hash-table. Not sure about the reason why dynkey-htab
is slower, it may be due to the overhead of hash function or hash
collision. The performance of dynkey hash table is only 5% good compared
with the normal hash-table when the max length of ->desc is 128. When
the max length of ->desc is 512, dynkey hash-table will be 20% faster
than normal hash table as shown below:

| max length of desc | qp-trie | dynkey-hash-tab |normal hash-tab | 
| ---   |  ---         | ---      | ---     |
|  64    | 7.5 M/s   | 7.1 M/s  | 8.3 M/s |
| 128    | 6.7 M/s   | 5.3 M/s  | 5.3 M/s |
| 256    | 4.9 M/s   | 3.4 M/s  | 3.2 M/s |
| 512    | 3.5 M/s   | 2.1 M/s  | 1.8 M/s |
| 1024   | 2.5 M/s   | 1.4 M/s  | 1.1 M/s |
| 2048   | 1.7 M/s   | 0.9 M/s  | 0.6 M/s |

When using strings from BTF, kallsyms or Alexa top 1M sites as the
dynptr in test_key, the performance of qp-trie is about 40% or more
slower than dynkey hash-table and normal hash-table. The mean length of
strings in these input files is about 17/24/15 respectively. And there
is no big difference between the lookup performance of dynkey hash-table
and normal hash-table. It may be due to the reason the implementation of
dynkey hash-table, because it invokes jhash three times to get the final
hash value.

| input | qp-trie | dynkey-hash-tab |normal hash-tab |
| ---      |  ---         | ---      | ---     |
| BTF      | 4.6 M/s   | 7.3 M/s  | 7.4 M/s |
| kallsyms | 4.7 M/s   | 6.5 M/s  | 6.5 M/s |
| top 1M   | 2.4 M/s   | 4.4 M/s  | 4.3 M/s |

The following is the detailed output of these benchmarks:

(1) Randomly generated key (the max/mean/stddev length of str in
test_key: 256/128/74)

$ sudo ./bench qp-trie-lookup --entries 100000 -a
Setting up benchmark 'qp-trie-lookup'...
generate 100000 random keys
str length: max 256 mean 128 stdev 74
Benchmark 'qp-trie-lookup' started.
Iter   0 ( 26.974us): hits    4.743M/s (  4.743M/prod), drops   
0.009M/s, total operations    4.753M/s
Iter   1 ( 23.165us): hits    4.957M/s (  4.957M/prod), drops   
0.009M/s, total operations    4.967M/s
Iter   2 (-22.390us): hits    4.994M/s (  4.994M/prod), drops   
0.010M/s, total operations    5.004M/s
Iter   3 ( -1.854us): hits    5.003M/s (  5.003M/prod), drops   
0.010M/s, total operations    5.013M/s
Iter   4 ( -0.583us): hits    4.997M/s (  4.997M/prod), drops   
0.010M/s, total operations    5.007M/s
Iter   5 ( 44.078us): hits    5.001M/s (  5.001M/prod), drops   
0.010M/s, total operations    5.011M/s
Iter   6 (-36.684us): hits    5.003M/s (  5.003M/prod), drops   
0.010M/s, total operations    5.013M/s
Summary: hits    4.993 ± 0.018M/s (  4.993M/prod), drops    0.010 ±
0.000M/s, total operations    5.002 ± 0.018M/s

$ sudo ./bench dynkey-htab-lookup --entries 100000 -a
Setting up benchmark 'dynkey-htab-lookup'...
generate 115980 random keys
str length: max 256 mean 129 stdev 74
Benchmark 'dynkey-htab-lookup' started.
Iter   0 ( 23.154us): hits    3.167M/s (  3.167M/prod), drops   
0.006M/s, total operations    3.173M/s
Iter   1 (  7.390us): hits    3.317M/s (  3.317M/prod), drops   
0.006M/s, total operations    3.324M/s
Iter   2 (  5.382us): hits    3.321M/s (  3.321M/prod), drops   
0.007M/s, total operations    3.327M/s
Iter   3 (-16.280us): hits    3.337M/s (  3.337M/prod), drops   
0.007M/s, total operations    3.344M/s
Iter   4 (  6.161us): hits    3.326M/s (  3.326M/prod), drops   
0.006M/s, total operations    3.333M/s
Iter   5 ( 36.514us): hits    3.328M/s (  3.328M/prod), drops   
0.006M/s, total operations    3.334M/s
Iter   6 (  3.131us): hits    3.322M/s (  3.322M/prod), drops   
0.007M/s, total operations    3.329M/s
Summary: hits    3.325 ± 0.007M/s (  3.325M/prod), drops    0.006 ±
0.000M/s, total operations    3.332 ± 0.007M/s

$ sudo ./bench htab-lookup --entries 100000 -a
Setting up benchmark 'htab-lookup'...
generate 100000 random keys
str length: max 256 mean 128 stdev 74
Benchmark 'htab-lookup' started.
Iter   0 ( 24.011us): hits    3.155M/s (  3.155M/prod), drops   
0.007M/s, total operations    3.161M/s
Iter   1 (  5.609us): hits    3.270M/s (  3.270M/prod), drops   
0.007M/s, total operations    3.277M/s
Iter   2 (  1.621us): hits    3.258M/s (  3.258M/prod), drops   
0.007M/s, total operations    3.264M/s
Iter   3 ( -5.675us): hits    3.260M/s (  3.260M/prod), drops   
0.007M/s, total operations    3.267M/s
Iter   4 ( -5.918us): hits    3.270M/s (  3.270M/prod), drops   
0.007M/s, total operations    3.276M/s
Iter   5 ( 58.144us): hits    3.273M/s (  3.273M/prod), drops   
0.007M/s, total operations    3.280M/s
Iter   6 (-50.080us): hits    3.254M/s (  3.254M/prod), drops   
0.007M/s, total operations    3.261M/s
Slab: 49.674 MiB
Summary: hits    3.264 ± 0.008M/s (  3.264M/prod), drops    0.007 ±
0.000M/s, total operations    3.271 ± 0.008M/s

(2) strings in BTF

[houtao@localhost bpf]$ sudo ./bench qp-trie-lookup --file btf.txt -a
Setting up benchmark 'qp-trie-lookup'...
item 115980
str length: max 71 mean 17 stdev 8
Benchmark 'qp-trie-lookup' started.
Iter   0 ( 41.758us): hits    4.570M/s (  4.570M/prod), drops   
0.000M/s, total operations    4.570M/s
Iter   1 ( -4.925us): hits    4.651M/s (  4.651M/prod), drops   
0.000M/s, total operations    4.651M/s
Iter   2 (-16.757us): hits    4.667M/s (  4.667M/prod), drops   
0.000M/s, total operations    4.667M/s
Iter   3 (  6.480us): hits    4.672M/s (  4.672M/prod), drops   
0.000M/s, total operations    4.672M/s
Iter   4 ( -8.465us): hits    4.670M/s (  4.670M/prod), drops   
0.000M/s, total operations    4.670M/s
Iter   5 ( 45.533us): hits    4.669M/s (  4.669M/prod), drops   
0.000M/s, total operations    4.669M/s
Iter   6 (-28.153us): hits    4.674M/s (  4.674M/prod), drops   
0.000M/s, total operations    4.674M/s
Summary: hits    4.667 ± 0.008M/s (  4.667M/prod), drops    0.000 ±
0.000M/s, total operations    4.667 ± 0.008M/s

$ sudo ./bench dynkey-htab-lookup --file btf.txt -a
Setting up benchmark 'dynkey-htab-lookup'...
item 115980
str length: max 71 mean 17 stdev 8
Benchmark 'dynkey-htab-lookup' started.
Iter   0 ( 18.721us): hits    7.144M/s (  7.144M/prod), drops   
0.000M/s, total operations    7.144M/s
Iter   1 (  5.173us): hits    7.317M/s (  7.317M/prod), drops   
0.000M/s, total operations    7.317M/s
Iter   2 (-10.935us): hits    7.297M/s (  7.297M/prod), drops   
0.000M/s, total operations    7.297M/s
Iter   3 ( 12.811us): hits    7.330M/s (  7.330M/prod), drops   
0.000M/s, total operations    7.330M/s
Iter   4 (-13.239us): hits    7.314M/s (  7.314M/prod), drops   
0.000M/s, total operations    7.314M/s
Iter   5 ( 51.453us): hits    7.308M/s (  7.308M/prod), drops   
0.000M/s, total operations    7.308M/s
Iter   6 (-40.370us): hits    7.324M/s (  7.324M/prod), drops   
0.000M/s, total operations    7.324M/s
Summary: hits    7.315 ± 0.012M/s (  7.315M/prod), drops    0.000 ±
0.000M/s, total operations    7.315 ± 0.012M/s

$ sudo ./bench htab-lookup --file btf.txt -a
Setting up benchmark 'htab-lookup'...
item 115980
str length: max 71 mean 17 stdev 8
Benchmark 'htab-lookup' started.
Iter   0 ( 24.738us): hits    7.223M/s (  7.223M/prod), drops   
0.000M/s, total operations    7.223M/s
Iter   1 ( 18.379us): hits    7.352M/s (  7.352M/prod), drops   
0.000M/s, total operations    7.352M/s
Iter   2 (-23.881us): hits    7.419M/s (  7.419M/prod), drops   
0.000M/s, total operations    7.419M/s
Iter   3 ( -0.607us): hits    7.438M/s (  7.438M/prod), drops   
0.000M/s, total operations    7.438M/s
Iter   4 (  7.016us): hits    7.410M/s (  7.410M/prod), drops   
0.000M/s, total operations    7.410M/s
Iter   5 ( 31.897us): hits    7.403M/s (  7.403M/prod), drops   
0.000M/s, total operations    7.403M/s
Iter   6 (-14.837us): hits    7.428M/s (  7.428M/prod), drops   
0.000M/s, total operations    7.428M/s
Slab: 22.310 MiB
Summary: hits    7.408 ± 0.030M/s (  7.408M/prod), drops    0.000 ±
0.000M/s, total operations    7.408 ± 0.030M/s

> [1]:
> https://lore.kernel.org/bpf/20220726130005.3102470-1-houtao1@xxxxxxxxxx/
>