Re: [PATCH bpf-next v6 0/2] bpf: Add a generic bits iterator

Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> · Mon, 29 Apr 2024 09:27:55 -0700

On Sun, Apr 28, 2024 at 6:47 AM Yafang Shao <laoar.shao@xxxxxxxxx> wrote:
>
> On Sat, Apr 27, 2024 at 12:51 AM Andrii Nakryiko
> <andrii.nakryiko@xxxxxxxxx> wrote:
> >
> > On Thu, Apr 25, 2024 at 10:05 PM Yafang Shao <laoar.shao@xxxxxxxxx> wrote:
> > >
> > > On Fri, Apr 26, 2024 at 2:15 AM Andrii Nakryiko
> > > <andrii.nakryiko@xxxxxxxxx> wrote:
> > > >
> > > > On Wed, Apr 24, 2024 at 10:37 PM Yafang Shao <laoar.shao@xxxxxxxxx> wrote:
> > > > >
> > > > > On Thu, Apr 25, 2024 at 8:34 AM Andrii Nakryiko
> > > > > <andrii.nakryiko@xxxxxxxxx> wrote:
> > > > > >
> > > > > > On Thu, Apr 11, 2024 at 6:51 AM Yafang Shao <laoar.shao@xxxxxxxxx> wrote:
> > > > > > >
> > > > > > > On Thu, Apr 11, 2024 at 9:11 PM Yafang Shao <laoar.shao@xxxxxxxxx> wrote:
> > > > > > > >
> > > > > > > > Three new kfuncs, namely bpf_iter_bits_{new,next,destroy}, have been
> > > > > > > > added for the new bpf_iter_bits functionality. These kfuncs enable the
> > > > > > > > iteration of the bits from a given address and a given number of bits.
> > > > > > > >
> > > > > > > > - bpf_iter_bits_new
> > > > > > > >   Initialize a new bits iterator for a given memory area. Due to the
> > > > > > > >   limitation of bpf memalloc, the max number of bits to be iterated
> > > > > > > >   over is (4096 * 8).
> > > > > > > > - bpf_iter_bits_next
> > > > > > > >   Get the next bit in a bpf_iter_bits
> > > > > > > > - bpf_iter_bits_destroy
> > > > > > > >   Destroy a bpf_iter_bits
> > > > > > > >
> > > > > > > > The bits iterator can be used in any context and on any address.
> > > > > > > >
> > > > > > > > Changes:
> > > > > > > > - v5->v6:
> > > > > > > >   - Add positive tests (Andrii)
> > > > > > > > - v4->v5:
> > > > > > > >   - Simplify test cases (Andrii)
> > > > > > > > - v3->v4:
> > > > > > > >   - Fix endianness error on s390x (Andrii)
> > > > > > > >   - zero-initialize kit->bits_copy and zero out nr_bits (Andrii)
> > > > > > > > - v2->v3:
> > > > > > > >   - Optimization for u64/u32 mask (Andrii)
> > > > > > > > - v1->v2:
> > > > > > > >   - Simplify the CPU number verification code to avoid the failure on s390x
> > > > > > > >     (Eduard)
> > > > > > > > - bpf: Add bpf_iter_cpumask
> > > > > > > >   https://lwn.net/Articles/961104/
> > > > > > > > - bpf: Add new bpf helper bpf_for_each_cpu
> > > > > > > >   https://lwn.net/Articles/939939/
> > > > > > > >
> > > > > > > > Yafang Shao (2):
> > > > > > > >   bpf: Add bits iterator
> > > > > > > >   selftests/bpf: Add selftest for bits iter
> > > > > > > >
> > > > > > > >  kernel/bpf/helpers.c                          | 120 +++++++++++++++++
> > > > > > > >  .../selftests/bpf/prog_tests/verifier.c       |   2 +
> > > > > > > >  .../selftests/bpf/progs/verifier_bits_iter.c  | 127 ++++++++++++++++++
> > > > > > > >  3 files changed, 249 insertions(+)
> > > > > > > >  create mode 100644 tools/testing/selftests/bpf/progs/verifier_bits_iter.c
> > > > > > > >
> > > > > > > > --
> > > > > > > > 2.39.1
> > > > > > > >
> > > > > > >
> > > > > > > It appears that the test case failed on s390x when the data is
> > > > > > > a u32 value because we need to set the higher 32 bits.
> > > > > > > will analyze it.
> > > > > > >
> > > > > >
> > > > > > Hey Yafang, did you get a chance to debug and fix the issue?
> > > > > >
> > > > >
> > > > > Hi Andrii,
> > > > >
> > > > > Apologies for the delay; I recently returned from an extended holiday.
> > > > >
> > > > > The issue stems from the limitations of bpf_probe_read_kernel() on
> > > > > s390 architecture. The attachment provides a straightforward example
> > > > > to illustrate this issue. The observed results are as follows:
> > > > >
> > > > >     Error: #463/1 verifier_probe_read/probe read 4 bytes
> > > > >     8897 run_subtest:PASS:obj_open_mem 0 nsec
> > > > >     8898 run_subtest:PASS:unexpected_load_failure 0 nsec
> > > > >     8899 do_prog_test_run:PASS:bpf_prog_test_run 0 nsec
> > > > >     8900 run_subtest:FAIL:659 Unexpected retval: 2817064 != 512
> > > > >
> > > > >     Error: #463/2 verifier_probe_read/probe read 8 bytes
> > > > >     8903 run_subtest:PASS:obj_open_mem 0 nsec
> > > > >     8904 run_subtest:PASS:unexpected_load_failure 0 nsec
> > > > >     8905 do_prog_test_run:PASS:bpf_prog_test_run 0 nsec
> > > > >     8906 run_subtest:FAIL:659 Unexpected retval: 0 != 512
> > > > >
> > > > > More details can be found at:  https://github.com/kernel-patches/bpf/pull/6872
> > > > >
> > > > > Should we consider this behavior of bpf_probe_read_kernel() as
> > > > > expected, or is it something that requires fixing?
> > > > >
> > > >
> > > > I might be missing something, but there is nothing wrong with
> > > > bpf_probe_read_kernel() behavior. In "read 4" case you are overwriting
> > > > only upper 4 bytes of u64, so lower 4 bytes are garbage. In "read 8"
> > > > you are reading (upper) 4 bytes of garbage from uninitialized
> > > > data_dst.
> > >
> > > The issue doesn't lie with the dst but rather with the src. Even after
> > > initializing the destination, the operation still fails. You can find
> >
> > Are you sure the operation "fails"? If it would fail, you'd get a
> > negative error code, but you are getting zero. Which actually makes
> > sense.
> >
> > I think you are just getting confused by big endianness of s390x, and
> > there is nothing wrong with bpf_probe_read_kernel().
> >
> > In both of your tests (I pasted your code below, it would be better if
> > you did it in your initial emails) you end up with 0x200 in *upper* 32
> > bits (on big endian) and lower bits are zeros. And __retval thing is
> > 32-bit (despite BPF program returning long), so this return value is
> > truncated to *lower* 32-bits, which are, expectedly, zeroes.
>
> Thank you for clarifying. The presence of the 32-bit __retval led to
> my misunderstanding :(
>
> >
> > So I think everything works as expected, but your tests (at least)
> > don't handle the big-endian arch well.
>
> The issue arises when the dst and src have different sizes, causing
> bpf_probe_read_kernel_common() to handle them poorly on big-endian
> machines. To address this, we need to calculate the offset for
> copying, as demonstrated by the following
>
>    bpf_probe_read_kernel_common(&kit->bits_copy + offset, size,
> unsafe_ptr__ign);
>
> One might wonder why this calculation is not incorporated directly
> into the implementation of bpf_probe_read_kernel_common() ?

Let's stop talking about
bpf_probe_read_kernel/bpf_probe_read_kernel_common doing anything
wrong or not handling anything wrong. It's not, it's correct. Your
code is *using* it wrong, and that's what we'll have to fix. The
contract of bpf_probe_read_kernel is in terms of memory addresses of
source/destination *bytes* and the *common* size of both source and
destination. bpf_probe_read_kernel() cannot know that you are passing
a pointer to int or long or whatever, it's just a void * pointer. So
if you are writing bytes over long, you need to take care of adjusting
offsets to accommodate big-endian.

It's a distraction to talk about bpf_probe_read_kernel() here (but
it's useful to understand its contract).

>
> >
> > __description("probe read 4 bytes")
> > __success __retval(0x200)
> > long probe_read_4(void)
> > {
> >     int data = 0x200;
> >     long data_dst = 0;
> >     int err;
> >
> >     err = bpf_probe_read_kernel(&data_dst, 4, &data);
> >     if (err)
> >         return err;
> >
> >     return data_dst;
> > }
> >
> > SEC("syscall")
> > __description("probe read 8 bytes")
> > __success __retval(0x200)
> > long probe_read_8(void)
> > {
> >     int data = 0x200;
> >     long data_dst = 0;
> >     int err;
> >
> >     err = bpf_probe_read_kernel(&data_dst, 8, &data);
> >     if (err)
> >         return err;
> >
> >     return data_dst;
> >
> > }
> >
> > > more details in the following link:
> > > https://github.com/kernel-patches/bpf/pull/6882. It appears that
> > > bpf_probe_read_kernel() encounters difficulties when dealing with
> > > non-long-aligned source addresses.
> > >
> > > >
> > > > So getting back to iter implementation. Make sure you are
> > > > zero-initializing that u64 value you are reading into?
> > > >
> > >
> > > It has been zero-initialized:
> > >
> > > + kit->nr_bits = 0;
> > > + kit->bits_copy = 0;
> > >
> >
> > ok, then the problem is somewhere else, but it doesn't seem to be in
> > bpf_probe_read_kernel(). I'm forgetting what was the original test
> > failure for your patch set, but please double check again, taking into
> > account the big endianness of s390x.
> >
>
> If we aim to make it compatible with s390, we need to introduce some
> constraints regarding the bits iteration.
>
> 1. We must replace nr_bits with size:
>
>   bpf_iter_bits_new(struct bpf_iter_bits *it, const void
> *unsafe_ptr__ign, u32 size)
>
> 2. The size must adhere to alignment requirements:
>
>         if (size <= sizeof(u64)) {
>                 int offset = IS_ENABLED(CONFIG_S390) ? sizeof(u64) - size : 0;
>
>                 switch (size) {
>                 case 1:
>                 case 2:
>                 case 4:
>                 case 8:
>                         break;
>                 default:
>                         return -EINVAL;
>                 }
>
>                 err = bpf_probe_read_kernel_common(((char
> *)&kit->bits_copy) + offset, size, unsafe_ptr__ign);
>                 if (err)
>                         return -EFAULT;
>
>                 kit->size = size;
>                 kit->bit = -1;
>                 return 0;
>         }
>
>         /* Not long-aligned */
>         if (size & (sizeof(unsigned long) - 1))
>                 return -EINVAL;
>
>         ....
>
> Does this meet your expectations?
>

I don't think you need to add any restrictions or change anything to
be byte-sized. You just need to calculate byte offset and do a bit
shift to place N bits from the mask into upper N bits of long on
big-endian. You might need to do some masking even for little endian
as well.

Which is why I suggested investing in simple but thorough tests. Write
down a few bit mask patterns of variable length (not just
multiple-of-8 sizes) and think about the sequence of bits that should
be returned. And then codify that.

Then check that both little- and big-endian arches work correctly.

This is all a bit tricky because kernel's find_next_bit() makes the
assumption that bit masks are long-based, while this bits iterator
protocol is based on bit sizes, which are not necessarily multiples of
8 bits. So after you copy memory as bytes, you might need to clear
(and shift, for big endian) some bits. Use simple but non-symmetrical
bit masks to test everything.

> --
> Regards
>
>
>
> Yafang