Re: [PATCH 1/2 v2] kprobe: Do not use uaccess functions to access kernel memory that can fault

Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> · Fri, 22 Feb 2019 18:28:53 -0800

On Fri, Feb 22, 2019 at 04:08:59PM -0800, Linus Torvalds wrote:
> On Fri, Feb 22, 2019 at 3:56 PM Alexei Starovoitov
> <alexei.starovoitov@xxxxxxxxx> wrote:
> >
> > It will preserve existing bpf_probe_read() behavior on x86.
> 
> ... but that's the worst possible situation.
> 
> It appears that people haven't understood that kernel and user
> addresses are distinct, and may have written programs that are
> fundamentally buggy.
> 
> And we _want_ to make it clear that they are buggy on x86-64, exactly
> because x86-64 is the one that gets the most testing - by far.
> 
> So if x86-64 continues working for buggy programs, then that only
> means that those bugs never get fixed.
> 
> It would be much better to try to get those things fixed, and make the
> x86-64 implementation stricter, exactly so that people end up
> _realizing_ that they can't just think "a pointer is a pointer, and
> the context doesn't matter".
>
> From a pure functional safety standpoint, I thought bpf already knew
> what kind of a pointer it had?

when bpf verifier knows the type of pointer it allows direct access to it.
That's the case for skb, socket, packet data, hash maps, arrays, stack, etc.
Networking progs cannot call bpf_probe_read().
It's available to tracing progs only and their goal is to walk the kernel and
user memory with addresses that cannot be statically verified
at program load time.
We are working on adding type information (BTF) to vmlinux.
Soon we'll be able to tell the type of every kernel function argument.
Right now arg1 and arg2 in a kprobed function are just u64 pt_regs->di, si.
Soon we'll be able to precisely identify their C type.

I completely agree on the direction that x86 is the architecture that
sets an example and users need to learn the difference in pointers.
I only disagree on timing.
Right now users don't have an alternative.
In our repo I counted ~400 calls to bpf_probe_read and about 10 times more
'indirect' calls. What's happening with 'indirect' calls is BCC toolchain
using clang to automatically replace struct_a->field_foo access with
bpf_probe_read(struct_a + offsetof(typeof(struct_a), field_foo));

If we had __user vs __kernel annotation available to clang we could have
taught BCC to replace this '->' dereference with appropriate kernel vs user
helper. Also we need to teach GCC to recognize __user and store into dwarf,
so we can take it further into BTF and verify later.

Also I think disallowing bpf_probe_read() to read user pointer will not
make a desired teaching effect. It will only cause painful debugging to folks
when their progs will stop working. It's better to remove bpf_probe_read()
completely.
imo the process of teaching the users of kernel vs user pointer difference
needs to be gradual.
First we introduce bpf_probe_kernel_read and bpf_probe_user_read and
introduce clang/gcc tooling to catch the mistakes.
Going over this 400+ places and manually grepping kernel sources
for __user keyword is not a great proposal if we want to keep those users.
Once we have this working we can remove bpf_probe_read() altogether.
Rejecting bpf prog at load time is a clear signal that user has to fix it
(instead of changing run-time behavior).
When the verifier gets even smarter it could potentially replace prob_read
with probe_kernel_read and probe_user_read when it has that type info.

imo this kernel release should finish as-is and in the next cycle we can
change probe_kernel_read() to reject user address, have temporary
workaround in bpf_probe_read() with probe_kernel_read+get_user hack,
introduce new bpf helpers, new tooling and eventually remove
buggy bpf_probe_read.