On Sat, Feb 22, 2020 at 09:40:10AM +0100, Thomas Gleixner wrote: > Alexei, > > Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> writes: > > On Thu, Feb 20, 2020 at 09:45:18PM +0100, Thomas Gleixner wrote: > >> The assumption that only programs attached to perf NMI events can deadlock > >> on memory allocators is wrong. Assume the following simplified callchain: > >> */ > >> - if (prog->type == BPF_PROG_TYPE_PERF_EVENT) { > >> + if ((is_tracing_prog_type(prog->type)) { > > > > This doesn't build. > > I assumed the typo somehow sneaked in and proceeded, but it broke > > a bunch of tests: > > Summary: 1526 PASSED, 0 SKIPPED, 54 FAILED > > One can argue that the test are unsafe and broken. > > We used to test all those tests with and without prealloc: > > map_flags = 0; > > run_all_tests(); > > map_flags = BPF_F_NO_PREALLOC; > > run_all_tests(); > > Then 4 years ago commit 5aa5bd14c5f866 switched hashmap to be no_prealloc > > always and that how it stayed since then. We can adjust the tests to use > > prealloc with tracing progs, but this breakage shows that there could be plenty > > of bpf users that also use BPF_F_NO_PREALLOC with tracing. It could simply > > be because they know that their kprobes are in a safe spot (and kmalloc is ok) > > and they want to save memory. They could be using large max_entries parameter > > for worst case hash map usage, but typical load is low. In general hashtables > > don't perform well after 50%, so prealloc is wasting half of the memory. Since > > we cannot control where kprobes are placed I'm not sure what is the right fix > > here. It feels that if we proceed with this patch somebody will complain and we > > would have to revert, but I'm willing to take this risk if we cannot come up > > with an alternative fix. > > Having something which is known to be broken exposed is not a good option > either. > > Just assume that someone is investigating a kernel issue. BOFH who is > stuck in the 90's uses perf, kprobes and tracepoints. Now he goes on > vacation and the new kid in the team decides to flip that over to BPF. > So now instead of getting information he deadlocks or crashes the > machine. > > You can't just tell him, don't do that then. It's broken by design and > you really can't tell which probes are safe and which are not because > the allocator calls out into whatever functions which might look > completely unrelated. > > So one way to phase this out would be: > > if (is_tracing()) { > if (is_perf() || IS_ENABLED(RT)) > return -EINVAL; > WARN_ONCE(.....) > } > > And clearly write in the warning that this is dangerous, broken and > about to be forbidden. Hmm? Yeah. Let's start with WARN_ONCE and verbose(env, "dangerous, broken") so the users see it in the verifier log and people who maintain servers (like kernel-team-s in fb, goog, etc) see it as well in their dmesg logs. So the motivation will be on both sides. Then in few kernel releases we can flip it to disable. Or we'll find a way to make it work without pre-allocating.