Re: Page faults in tracepoint caused by aliased pointer

Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx> · Tue, 13 Feb 2024 01:33:59 +0100

On Tue, 13 Feb 2024 at 01:21, Yan Zhai <yan@xxxxxxxxxxxxxx> wrote:
>
> On Mon, Feb 12, 2024 at 5:52 PM Alexei Starovoitov
> <alexei.starovoitov@xxxxxxxxx> wrote:
> >
> > On Mon, Feb 12, 2024 at 3:42 PM Kumar Kartikeya Dwivedi
> > <memxor@xxxxxxxxx> wrote:
> > >
> > > On Tue, 13 Feb 2024 at 00:34, Alexei Starovoitov
> > > <alexei.starovoitov@xxxxxxxxx> wrote:
> > > >
> > > > On Mon, Feb 12, 2024 at 3:16 PM Ignat Korchagin <ignat@xxxxxxxxxxxxxx> wrote:
> > > > >
> > > > > [288931.217143][T109754] CPU: 4 PID: 109754 Comm: bpftrace Not tainted
> > > > > 6.6.16+ #10
> > > >
> > > > ...
> > > > > [288931.217143][T109754]  ? copy_from_kernel_nofault+0x1d/0xe0
> > > > > [288931.217143][T109754]  bpf_probe_read_compat+0x6a/0x90
> > > > >
> > > > > And Jakub CCed here did it for 6.8.0-rc2+
> > > >
> > > > I suspect something is broken in your kernels.
> > > > Above is doing generic copy_from_kernel_nofault(),
> > > > so one should be able to crash the kernel without any bpf.
> > > >
> > > > We have this in selftests/bpf:
> > > > __weak noinline struct file *bpf_testmod_return_ptr(int arg)
> > > > {
> > > >         static struct file f = {};
> > > >
> > > >         switch (arg) {
> > > >         case 1: return (void *)EINVAL;          /* user addr */
> > > >         case 2: return (void *)0xcafe4a11;      /* user addr */
> > > >         case 3: return (void *)-EINVAL;         /* canonical, but invalid */
> > > >         case 4: return (void *)(1ull << 60);    /* non-canonical and invalid */
> > > >         case 5: return (void *)~(1ull << 30);   /* trigger extable */
> > > >         case 6: return &f;                      /* valid addr */
> > > >         case 7: return (void *)((long)&f | 1);  /* kernel tricks */
> > > >         default: return NULL;
> > > >         }
> > > > }
> > > > where we check that extables setup by JIT for bpf progs are working correctly.
> > > > You should see the kernel crashing when you just run bpf selftests.
> > >
> > > I agree, this appears unrelated to BPF since it is happening when
> > > using copy_from_kernel_nofault (which should be jumping to the Efault
> > > label instead of the oops), but I think it's not specific to some
> > > custom kernel. I can reproduce it on my dev machine on top of bpf-next
> > > as well, and another machine with Ubuntu's generic 6.5 kernel for
> > > 24.04. And I think Ignat tried it on the mainline 6.8-rc2 as well.
> >
> copy_from_kernel_nofault is called in Jakub's reproducer, but the
> panic case in our production seems to be direct memory accessing
> according to bpftool dumped jited code. Will faults from such
> instructions also be caught correctly?
>

Yep, since faults in both cases end up in the page fault handler.
Once the fix pointed out by Alexei is applied, it should address both scenarios.

> Yan
>
> > Then it must be vsyscall address that this series are fixing:
> > https://patchwork.kernel.org/project/netdevbpf/patch/20240202103935.3154011-3-houtao@xxxxxxxxxxxxxxx/
> >
> > We're still waiting on x86 maintainers to ack them.