On June 10, 2024 11:20:21 AM PDT, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: >On Mon, 10 Jun 2024 at 05:02, Borislav Petkov <bp@xxxxxxxxx> wrote: >> >> I think we should accept patches using this only when there really is >> a good, perf reason for doing so. Not "I wanna use this fance shite in >> my new driver just because...". > >Absolutely. > >So for example, if the code could possibly be a module, it's never >going to be able to use runtime constants. > >If the code doesn't show up as "noticeable percentage of kernel time >on real loads", it will not be a valid use for runtime constants. > >The reason I did __d_lookup_rcu() is that I have optimized that >function to hell and back before, and it *still* showed up at 14% of >kernel time on my "empty kernel build" benchmark. And the constant >load was a noticeable - but not dominant - part of that. > >And yes, it shows up that high because it's all D$ misses, and the >machine I tested on has more CPU cores than cache, so it's all kinds >of broken. But the point ends up being that __d_lookup_rcu() is just >very very hot on loads that just do a lot of 'stat()' calls (and such >loads exist and aren't just microbenchmarks). > >I have other functions I see in the 5%+ range of kernel overhead on >real machines, but they tend to be things like clear_page(), which is >another kind of issue entirely. > >And yes, the benchmarks I run are odd ("why would anybody care about >an empty kernel build?") but somewhat real to me (since I do builds >between every pull even when they just change a couple of files). > >And yes, to actually even see anything else than the CPU security >issues on x86, you need to build without debug support, and without >retpolines etc. So my profiles are "fake" in that sense, because they >are the best case profiles without a lot of the horror that people >enable. > >Others will have other real benchmarks, which is why I do think we'd >end up with more uses of this. But I would expect a handful, not >"hundreds". > >I could imagine some runtime constant in the core networking socket >code, for example. Or in some scheduler thing. Or kernel entry code. > >But not ever in a driver or a filesystem, for example. Once you've >gotten that far off the core code path, the "load a variable" overhead >just isn't relevant any more. > > Linus So I would also strongly suggest that we make the code fault if it is executed unpatched if there is no fallback. My original motivation for making the code as complex as I did was to handle both the init and noninit cases, and to handle things other than mov (for the mov case, it is quite trivial to allow for either a memory reference or an immediate, of course, but I was trying to optimize all kinds of operations, of which shifts were by far the worst. Almost certainly a mistake on my part.) For that case it also works just fine in modules, since alternatives does the patching there already. The way I did it was to introduce a new alternative patching type (a concept that didn't exist yet, and again I made the mistake of not simply doing that part separately first) using the address of the variable in question (in the ro_after_init segment) as the alternatives source buffer. Much has changed since I hacked on this, a lot of which makes this whole concept *much* easier to implement. We could even make it completely transparent at the programmer level by using objtool or a linker plugin to detect loads from a certain segment and tag them for alternatives patching (which would, however, require adding padding to some kinds of instructions, notably arbitrary 64-bit immediates.) The one case that gets really nasty for automatic conversation is the case of an arbitrary 64-bit value used in a computation, like: __patchable long b; long a; /* ... */ a += b; The compiler can generate: addq b(%rip),%rax ... which would require a temporary register to convert to an immediate. On the other hand: __patchable int b; long a; /* ... */ a += b; ... would generate ... movslq b(%rip),%rbx addq %rbx,%rax ... which can be compacted down to a single instruction: addq $bimm,%rax ... so *really* optimizing this could be rather tricky without some kind of compiler support (native or via plug-in) even. The most likely option would be to generate object code as if immediate instructions are used, but with relocations that indicate "value of" rather than "address of" the variable being referenced. Postprocessing either at the assembly level or object code level (i.e. objtool) can then substitute the prepatch sequence and alternatives metadata. (If it is a custom plug-in then postprocessing is presumably not necessary, of course.)