On Fri, May 31, 2024 at 11:33:31AM +0200, Vlastimil Babka wrote: >Incomplete, help needed from ftrace/kprobe and bpf folks. > >As previously mentioned by myself [1] and others [2] the functions >designed for error injection can bring visible overhead in fastpaths >such as slab or page allocation, because even if nothing hooks into them >at a given moment, they are noninline function calls regardless of >CONFIG_ options since commits 4f6923fbb352 ("mm: make should_failslab >always available for fault injection") and af3b854492f3 >("mm/page_alloc.c: allow error injection"). > >Live patching their callsites has been also suggested in both [1] and >[2] threads, and this is an attempt to do that with static keys that >guard the call sites. When disabled, the error injection functions still >exist and are noinline, but are not being called. Any of the existing >mechanisms that can inject errors should make sure to enable the >respective static key. I have added that support to some of them but >need help with the others. > >- the legacy fault injection, i.e. CONFIG_FAILSLAB and > CONFIG_FAIL_PAGE_ALLOC is handled in Patch 1, and can be passed the > address of the static key if it exists. The key will be activated if the > fault injection probability becomes non-zero, and deactivated in the > opposite transition. This also removes the overhead of the evaluation > (on top of the noninline function call) when these mechanisms are > configured in the kernel but unused at the moment. > >- the generic error injection using kretprobes with > override_function_with_return is handled in patch 2. The > ALLOW_ERROR_INJECTION() annotation is extended so that static key > address can be passed, and the framework controls it when error > injection is enabled or disabled in debugfs for the function. > >There are two more users I know of but am not familiar enough to fix up >myself. I hope people that are more familiar can help me here. > >- ftrace seems to be using override_function_with_return from > #define ftrace_override_function_with_return but I found no place > where the latter is used. I assume it might be hidden behind more > macro magic? But the point is if ftrace can be instructed to act like > an error injection, it would also have to use some form of metadata > (from patch 2 presumably?) to get to the static key and control it. > > If ftrace can only observe the function being called, maybe it > wouldn't be wrong to just observe nothing if the static key isn't > enabled because nobody is doing the fault injection? > >- bpftrace, as can be seen from the example in commit 4f6923fbb352 > description. I suppose bpf is already aware what functions the > currently loaded bpf programs hook into, so that it could look up the > static key and control it. Maybe using again the metadata from patch 2, > or extending its own, as I've noticed there's e.g. BTF_ID(func, > should_failslab) > >Now I realize maybe handling this at the k(ret)probe level would be >sufficient for all cases except the legacy fault injection from Patch 1? >Also wanted to note that by AFAIU by using the static_key_slow_dec/inc >API (as done in patches 1/2) should allow all mechanisms to coexist >naturally without fighting each other on the static key state, and also >handle the reference count for e.g. active probes or bpf programs if >there's no similar internal mechanism. > >Patches 3 and 4 implement the static keys for the two mm fault injection >sites in slab and page allocators. For a quick demonstration I've run a >VM and the simple test from [1] that stresses the slab allocator and got I took a look into [1] and I see some data like "1.43% plus the overhead in its caller", but not clearly find which test cases are. Sorry for my unfamiliarity, would you mind giving more words on the cases? >this time before the series: > >real 0m8.349s >user 0m0.694s >sys 0m7.648s > >with perf showing > > 0.61% nonexistent [kernel.kallsyms] [k] should_failslab.constprop.0 > 0.00% nonexistent [kernel.kallsyms] [k] should_fail_alloc_page ▒ > >And after the series > >real 0m7.924s >user 0m0.727s >sys 0m7.191s > Maybe add the percentage here would be more helpful. >and the functions gone from perf report. > >There might be other such fault injection callsites in hotpaths of other >subsystems but I didn't search for them at this point. > >[1] https://lore.kernel.org/all/6d5bb852-8703-4abf-a52b-90816bccbd7f@xxxxxxx/ >[2] https://lore.kernel.org/all/3j5d3p22ssv7xoaghzraa7crcfih3h2qqjlhmjppbp6f42pg2t@kg7qoicog5ye/ > >Signed-off-by: Vlastimil Babka <vbabka@xxxxxxx> >--- >Vlastimil Babka (4): > fault-inject: add support for static keys around fault injection sites > error-injection: support static keys around injectable functions > mm, slab: add static key for should_failslab() > mm, page_alloc: add static key for should_fail_alloc_page() > > include/asm-generic/error-injection.h | 13 ++++++++++- > include/asm-generic/vmlinux.lds.h | 2 +- > include/linux/error-injection.h | 9 +++++--- > include/linux/fault-inject.h | 7 +++++- > kernel/fail_function.c | 22 +++++++++++++++--- > lib/error-inject.c | 6 ++++- > lib/fault-inject.c | 43 ++++++++++++++++++++++++++++++++++- > mm/fail_page_alloc.c | 3 ++- > mm/failslab.c | 2 +- > mm/internal.h | 2 ++ > mm/page_alloc.c | 11 ++++++--- > mm/slab.h | 3 +++ > mm/slub.c | 10 +++++--- > 13 files changed, 114 insertions(+), 19 deletions(-) >--- >base-commit: 1613e604df0cd359cf2a7fbd9be7a0bcfacfabd0 >change-id: 20240530-fault-injection-statickeys-66b7222e91b7 > >Best regards, >-- >Vlastimil Babka <vbabka@xxxxxxx> > -- Wei Yang Help you, Help me