Re: [PATCH bpf-next v6 0/2] bpf: Add a generic bits iterator

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Apr 30, 2024 at 12:28 AM Andrii Nakryiko
<andrii.nakryiko@xxxxxxxxx> wrote:
>
> On Sun, Apr 28, 2024 at 6:47 AM Yafang Shao <laoar.shao@xxxxxxxxx> wrote:
> >
> > On Sat, Apr 27, 2024 at 12:51 AM Andrii Nakryiko
> > <andrii.nakryiko@xxxxxxxxx> wrote:
> > >
> > > On Thu, Apr 25, 2024 at 10:05 PM Yafang Shao <laoar.shao@xxxxxxxxx> wrote:
> > > >
> > > > On Fri, Apr 26, 2024 at 2:15 AM Andrii Nakryiko
> > > > <andrii.nakryiko@xxxxxxxxx> wrote:
> > > > >
> > > > > On Wed, Apr 24, 2024 at 10:37 PM Yafang Shao <laoar.shao@xxxxxxxxx> wrote:
> > > > > >
> > > > > > On Thu, Apr 25, 2024 at 8:34 AM Andrii Nakryiko
> > > > > > <andrii.nakryiko@xxxxxxxxx> wrote:
> > > > > > >
> > > > > > > On Thu, Apr 11, 2024 at 6:51 AM Yafang Shao <laoar.shao@xxxxxxxxx> wrote:
> > > > > > > >
> > > > > > > > On Thu, Apr 11, 2024 at 9:11 PM Yafang Shao <laoar.shao@xxxxxxxxx> wrote:
> > > > > > > > >
> > > > > > > > > Three new kfuncs, namely bpf_iter_bits_{new,next,destroy}, have been
> > > > > > > > > added for the new bpf_iter_bits functionality. These kfuncs enable the
> > > > > > > > > iteration of the bits from a given address and a given number of bits.
> > > > > > > > >
> > > > > > > > > - bpf_iter_bits_new
> > > > > > > > >   Initialize a new bits iterator for a given memory area. Due to the
> > > > > > > > >   limitation of bpf memalloc, the max number of bits to be iterated
> > > > > > > > >   over is (4096 * 8).
> > > > > > > > > - bpf_iter_bits_next
> > > > > > > > >   Get the next bit in a bpf_iter_bits
> > > > > > > > > - bpf_iter_bits_destroy
> > > > > > > > >   Destroy a bpf_iter_bits
> > > > > > > > >
> > > > > > > > > The bits iterator can be used in any context and on any address.
> > > > > > > > >
> > > > > > > > > Changes:
> > > > > > > > > - v5->v6:
> > > > > > > > >   - Add positive tests (Andrii)
> > > > > > > > > - v4->v5:
> > > > > > > > >   - Simplify test cases (Andrii)
> > > > > > > > > - v3->v4:
> > > > > > > > >   - Fix endianness error on s390x (Andrii)
> > > > > > > > >   - zero-initialize kit->bits_copy and zero out nr_bits (Andrii)
> > > > > > > > > - v2->v3:
> > > > > > > > >   - Optimization for u64/u32 mask (Andrii)
> > > > > > > > > - v1->v2:
> > > > > > > > >   - Simplify the CPU number verification code to avoid the failure on s390x
> > > > > > > > >     (Eduard)
> > > > > > > > > - bpf: Add bpf_iter_cpumask
> > > > > > > > >   https://lwn.net/Articles/961104/
> > > > > > > > > - bpf: Add new bpf helper bpf_for_each_cpu
> > > > > > > > >   https://lwn.net/Articles/939939/
> > > > > > > > >
> > > > > > > > > Yafang Shao (2):
> > > > > > > > >   bpf: Add bits iterator
> > > > > > > > >   selftests/bpf: Add selftest for bits iter
> > > > > > > > >
> > > > > > > > >  kernel/bpf/helpers.c                          | 120 +++++++++++++++++
> > > > > > > > >  .../selftests/bpf/prog_tests/verifier.c       |   2 +
> > > > > > > > >  .../selftests/bpf/progs/verifier_bits_iter.c  | 127 ++++++++++++++++++
> > > > > > > > >  3 files changed, 249 insertions(+)
> > > > > > > > >  create mode 100644 tools/testing/selftests/bpf/progs/verifier_bits_iter.c
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > 2.39.1
> > > > > > > > >
> > > > > > > >
> > > > > > > > It appears that the test case failed on s390x when the data is
> > > > > > > > a u32 value because we need to set the higher 32 bits.
> > > > > > > > will analyze it.
> > > > > > > >
> > > > > > >
> > > > > > > Hey Yafang, did you get a chance to debug and fix the issue?
> > > > > > >
> > > > > >
> > > > > > Hi Andrii,
> > > > > >
> > > > > > Apologies for the delay; I recently returned from an extended holiday.
> > > > > >
> > > > > > The issue stems from the limitations of bpf_probe_read_kernel() on
> > > > > > s390 architecture. The attachment provides a straightforward example
> > > > > > to illustrate this issue. The observed results are as follows:
> > > > > >
> > > > > >     Error: #463/1 verifier_probe_read/probe read 4 bytes
> > > > > >     8897 run_subtest:PASS:obj_open_mem 0 nsec
> > > > > >     8898 run_subtest:PASS:unexpected_load_failure 0 nsec
> > > > > >     8899 do_prog_test_run:PASS:bpf_prog_test_run 0 nsec
> > > > > >     8900 run_subtest:FAIL:659 Unexpected retval: 2817064 != 512
> > > > > >
> > > > > >     Error: #463/2 verifier_probe_read/probe read 8 bytes
> > > > > >     8903 run_subtest:PASS:obj_open_mem 0 nsec
> > > > > >     8904 run_subtest:PASS:unexpected_load_failure 0 nsec
> > > > > >     8905 do_prog_test_run:PASS:bpf_prog_test_run 0 nsec
> > > > > >     8906 run_subtest:FAIL:659 Unexpected retval: 0 != 512
> > > > > >
> > > > > > More details can be found at:  https://github.com/kernel-patches/bpf/pull/6872
> > > > > >
> > > > > > Should we consider this behavior of bpf_probe_read_kernel() as
> > > > > > expected, or is it something that requires fixing?
> > > > > >
> > > > >
> > > > > I might be missing something, but there is nothing wrong with
> > > > > bpf_probe_read_kernel() behavior. In "read 4" case you are overwriting
> > > > > only upper 4 bytes of u64, so lower 4 bytes are garbage. In "read 8"
> > > > > you are reading (upper) 4 bytes of garbage from uninitialized
> > > > > data_dst.
> > > >
> > > > The issue doesn't lie with the dst but rather with the src. Even after
> > > > initializing the destination, the operation still fails. You can find
> > >
> > > Are you sure the operation "fails"? If it would fail, you'd get a
> > > negative error code, but you are getting zero. Which actually makes
> > > sense.
> > >
> > > I think you are just getting confused by big endianness of s390x, and
> > > there is nothing wrong with bpf_probe_read_kernel().
> > >
> > > In both of your tests (I pasted your code below, it would be better if
> > > you did it in your initial emails) you end up with 0x200 in *upper* 32
> > > bits (on big endian) and lower bits are zeros. And __retval thing is
> > > 32-bit (despite BPF program returning long), so this return value is
> > > truncated to *lower* 32-bits, which are, expectedly, zeroes.
> >
> > Thank you for clarifying. The presence of the 32-bit __retval led to
> > my misunderstanding :(
> >
> > >
> > > So I think everything works as expected, but your tests (at least)
> > > don't handle the big-endian arch well.
> >
> > The issue arises when the dst and src have different sizes, causing
> > bpf_probe_read_kernel_common() to handle them poorly on big-endian
> > machines. To address this, we need to calculate the offset for
> > copying, as demonstrated by the following
> >
> >    bpf_probe_read_kernel_common(&kit->bits_copy + offset, size,
> > unsafe_ptr__ign);
> >
> > One might wonder why this calculation is not incorporated directly
> > into the implementation of bpf_probe_read_kernel_common() ?
>
> Let's stop talking about
> bpf_probe_read_kernel/bpf_probe_read_kernel_common doing anything
> wrong or not handling anything wrong. It's not, it's correct. Your
> code is *using* it wrong, and that's what we'll have to fix. The
> contract of bpf_probe_read_kernel is in terms of memory addresses of
> source/destination *bytes* and the *common* size of both source and
> destination. bpf_probe_read_kernel() cannot know that you are passing
> a pointer to int or long or whatever, it's just a void * pointer. So
> if you are writing bytes over long, you need to take care of adjusting
> offsets to accommodate big-endian.
>
> It's a distraction to talk about bpf_probe_read_kernel() here (but
> it's useful to understand its contract).
>
> >
> > >
> > > __description("probe read 4 bytes")
> > > __success __retval(0x200)
> > > long probe_read_4(void)
> > > {
> > >     int data = 0x200;
> > >     long data_dst = 0;
> > >     int err;
> > >
> > >     err = bpf_probe_read_kernel(&data_dst, 4, &data);
> > >     if (err)
> > >         return err;
> > >
> > >     return data_dst;
> > > }
> > >
> > > SEC("syscall")
> > > __description("probe read 8 bytes")
> > > __success __retval(0x200)
> > > long probe_read_8(void)
> > > {
> > >     int data = 0x200;
> > >     long data_dst = 0;
> > >     int err;
> > >
> > >     err = bpf_probe_read_kernel(&data_dst, 8, &data);
> > >     if (err)
> > >         return err;
> > >
> > >     return data_dst;
> > >
> > > }
> > >
> > > > more details in the following link:
> > > > https://github.com/kernel-patches/bpf/pull/6882. It appears that
> > > > bpf_probe_read_kernel() encounters difficulties when dealing with
> > > > non-long-aligned source addresses.
> > > >
> > > > >
> > > > > So getting back to iter implementation. Make sure you are
> > > > > zero-initializing that u64 value you are reading into?
> > > > >
> > > >
> > > > It has been zero-initialized:
> > > >
> > > > + kit->nr_bits = 0;
> > > > + kit->bits_copy = 0;
> > > >
> > >
> > > ok, then the problem is somewhere else, but it doesn't seem to be in
> > > bpf_probe_read_kernel(). I'm forgetting what was the original test
> > > failure for your patch set, but please double check again, taking into
> > > account the big endianness of s390x.
> > >
> >
> > If we aim to make it compatible with s390, we need to introduce some
> > constraints regarding the bits iteration.
> >
> > 1. We must replace nr_bits with size:
> >
> >   bpf_iter_bits_new(struct bpf_iter_bits *it, const void
> > *unsafe_ptr__ign, u32 size)
> >
> > 2. The size must adhere to alignment requirements:
> >
> >         if (size <= sizeof(u64)) {
> >                 int offset = IS_ENABLED(CONFIG_S390) ? sizeof(u64) - size : 0;
> >
> >                 switch (size) {
> >                 case 1:
> >                 case 2:
> >                 case 4:
> >                 case 8:
> >                         break;
> >                 default:
> >                         return -EINVAL;
> >                 }
> >
> >                 err = bpf_probe_read_kernel_common(((char
> > *)&kit->bits_copy) + offset, size, unsafe_ptr__ign);
> >                 if (err)
> >                         return -EFAULT;
> >
> >                 kit->size = size;
> >                 kit->bit = -1;
> >                 return 0;
> >         }
> >
> >         /* Not long-aligned */
> >         if (size & (sizeof(unsigned long) - 1))
> >                 return -EINVAL;
> >
> >         ....
> >
> > Does this meet your expectations?
> >
>
> I don't think you need to add any restrictions or change anything to
> be byte-sized. You just need to calculate byte offset and do a bit
> shift to place N bits from the mask into upper N bits of long on
> big-endian. You might need to do some masking even for little endian
> as well.
>
> Which is why I suggested investing in simple but thorough tests. Write
> down a few bit mask patterns of variable length (not just
> multiple-of-8 sizes) and think about the sequence of bits that should
> be returned. And then codify that.
>
> Then check that both little- and big-endian arches work correctly.
>
> This is all a bit tricky because kernel's find_next_bit() makes the
> assumption that bit masks are long-based, while this bits iterator
> protocol is based on bit sizes, which are not necessarily multiples of
> 8 bits. So after you copy memory as bytes, you might need to clear
> (and shift, for big endian) some bits. Use simple but non-symmetrical
> bit masks to test everything.
>

The reason for using 'size' instead of 'nr_bits' lies in the nature of
the pointer 'unsafe_ptr__ign', which is of type void *. Due to this
ambiguity in the type, the 'size' parameter serves to indicate the
size of the data being accessed. For instance, if the type is u32,
then the 'size' parameter corresponds to sizeof(u32)

    u32 src = 0x100;
    bpf_for_each(bits, bit, &src, sizeof(u32));

This approach shields the user of the bits iterator from concerns
about endianness, simplifying usage.

Conversely, if we were to use 'nr_bits', the user would need to
account for endianness themselves. In other words, the user would be
responsible for calculating the offset of 'src'.For instance, on a
big-endian machine, if the 'src' is of type u64 and the user only want
to iterate over 32 bits, the code would appear as follows:

    u64 src = 0x100;
    bpf_for_each(bits, bit, ((void *)&src) + (sizeof(u64) - (32 + 7) / 8), 32);

It may indeed be less user-friendly. However, I can make the
adjustment as you suggested, given that it aligns with the pattern
observed in bpf_probe_read_kernel_common().

--
Regards
Yafang





[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux