Re: [PATCH v2 bpf-next 1/4] bpf: Add bloom filter map implementation

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Mon, 20 Sep 2021 16:21:09 -0700

On Mon, Sep 20, 2021 at 3:52 PM Joanne Koong <joannekoong@xxxxxx> wrote:
>
> My previous email replied to Alexei's email before I saw Andrii's new
> email, so please
> feel free to disregard my previous email.

Never got that reply of yours. Alexei's email arrived a few hours
after I've already replied to you. It was a time warp anomaly :)

>
> On 9/20/21 1:58 PM, Andrii Nakryiko wrote:
>
> > On Fri, Sep 17, 2021 at 6:08 PM Alexei Starovoitov
> > <alexei.starovoitov@xxxxxxxxx> wrote:
> >> On Mon, Sep 13, 2021 at 09:04:30PM -0700, Joanne Koong wrote:
> >>> +
> >>> +/* For bloom filter maps, the next 4 bits represent how many hashes to use.
> >>> + * The maximum number of hash functions supported is 15. If this is not set,
> >>> + * the default number of hash functions used will be 5.
> >>> + */
> >>> +     BPF_F_BLOOM_FILTER_HASH_BIT_1 = (1U << 13),
> >>> +     BPF_F_BLOOM_FILTER_HASH_BIT_2 = (1U << 14),
> >>> +     BPF_F_BLOOM_FILTER_HASH_BIT_3 = (1U << 15),
> >>> +     BPF_F_BLOOM_FILTER_HASH_BIT_4 = (1U << 16),
> >> The bit selection is unintuitive.
> >> Since key_size has to be zero may be used that instead to indicate the number of hash
> >> functions in the rare case when 5 is not good enough?
> > Hm... I was initially thinking about proposing something like that,
> > but it felt a bit ugly at the time. But now thinking about this a bit
> > more, I think this would be a bit more meaningful if we change the
> > terminology a bit. Instead of saying that Bloom filter has values and
> > no keys, we actually have keys and no values. So all those bytes that
> > are hashed are treated as keys (which is actually how sets are
> > implemented on top of maps, where you have keys and no values, or at
> > least the value is always true).
> >
> > So with that we'll have key/key_size to specify number of bytes that
> > needs to be hashed (and it's type info). And then we can squint a bit
> > and say that number of hashes are specified by value_size, as in
> > values are those nr_hash bits that we set in Bloom filter.
> >
> > Still a bit of terminology stretch, but won't necessitate those
> > specialized fields just for Bloom filter map. But if default value is
> > going to be good enough for most cases and most cases won't need to
> > adjust number of hashes, this seems to be pretty clean to me.
>
> With having bloom filter map keys instead of values,  I think this would
> lead to messier code in the kernel for handling map_lookup_elem
> and map_update_elem calls, due to the fact that the bloom filter map
> is a non-associative map and the current APIs for non-associative map types
> (peek_elem/push_elem/pop_elem) all have the map data as the value and
> not the key.
>
> For example, for map_update_elem, the API from the eBPF program side is
>
> int (*map_update_elem)(struct bpf_map *map, void *key, void *value, u64
> flags);
>
> This would require us to either
>
> a) Add some custom logic in syscall.c so that we bypass the
> copy_from_bpfptr call on
> bloom filter map values (necessary because memcpying 0 bytes still
> requires the src pointer
> to be valid), which would allow us to pass in a NULL value
>
> b) Add a new function like
>
> int (*map_push_key)(struct bpf_map *map, void *key, u64 flags)
>
> that eBPF programs can call instead of map_update_elem.
>
> or
>
> c) Try to repurpose the existing map_push_elem API by passing in the key
> instead of the value,
> which would lead to inconsistent use of the API
>
> I think if we could change the non-associative map types (currently only
> stack maps and queue
> maps, I believe) to have their data be a key instead of a value, and
> have the pop/peek APIs use
> keys instead of values, then this would be cleaner, since we could then
> just use the existing peek/pop
> APIs.

I don't think we can change existing map APIs anymore, unfortunately.

>
> >> Or use inner_map_fd since there is no possibility of having an inner map in bloomfilter.
> >> It could be a union:
> >>      __u32   max_entries;    /* max number of entries in a map */
> >>      __u32   map_flags;      /* BPF_MAP_CREATE related
> >>                               * flags defined above.
> >>                               */
> >>      union {
> >>         __u32  inner_map_fd;   /* fd pointing to the inner map */
> >>         __u32  nr_hash_funcs;  /* or number of hash functions */
> >>      };
> > This works as well. A bit more Bloom filter-only terminology
> > throughout UAPI and libbpf, but I'd be fine with that as well.
> >
> Great, it looks like this is the consensus - I will go with this option
> for v3!
> >>      __u32   numa_node;      /* numa node */
> >>
> >>> +struct bpf_bloom_filter {
> >>> +     struct bpf_map map;
> >>> +     u32 bit_array_mask;
> >>> +     u32 hash_seed;
> >>> +     /* If the size of the values in the bloom filter is u32 aligned,
> >>> +      * then it is more performant to use jhash2 as the underlying hash
> >>> +      * function, else we use jhash. This tracks the number of u32s
> >>> +      * in an u32-aligned value size. If the value size is not u32 aligned,
> >>> +      * this will be 0.
> >>> +      */
> >>> +     u32 aligned_u32_count;
> >> what is the performance difference?
> >> May be we enforce 4-byte sized value for simplicity?
> > This might be a bit too surprising, especially if keys are just some
> > strings, where people might not expect that it has to 4-byte multiple
> > size. And debugging this without extra tooling (like retsnoop) is
> > going to be nightmarish.
> >
> > If the performance diff is huge and that if/else logic is
> > unacceptable, we can also internally pad with up to 3 zero bytes and
> > include those into the hash.
> I think the if/else logic is unavoidable if we support non 4-byte
> aligned value sizes,
> unless we are okay with truncating any remainder bytes of non 4-byte
> aligned values
> and stipulating that a bloom filter map value size has to be greater
> than 4 bytes (these
> conditions would allow us to use jhash2 for every value without an
> if/else check). If we
> internally pad, we will have to pad on every update and lookup, which
> would also
> require an if/else.
> Thanks for the comments and reviews, Alexei and Andrii. They are much
> appreciated!

I don't think truncation is an option. And I also forgot that we don't
really store values, so there is nothing to pad, really. So yeah, I'd
keep it as is, especially if that is not expensive (which I assume
it's not). As I mentioned before, that logic can be encapsulated in a
dedicated helper function and reused in a few places.