On Thu, 11 Aug 2022 at 17:10, Imran Khan <imran.f.khan@xxxxxxxxxx> wrote: > > Hello Marco, > > On 11/8/22 11:21 pm, Marco Elver wrote: > > On Thu, 11 Aug 2022 at 12:07, <vbabka@xxxxxxx> wrote: > > [...] > >>> new flag SLAB_SKIP_KFENCE, it also can serve a dual purpose, where > >>> someone might want to explicitly opt out by default and pass it to > >>> kmem_cache_create() (for whatever reason; not that we'd encourage > >>> that). > >> > >> Right, not be able to do that would be a downside (although it should be > >> possible even with opt-in to add an opt-out cache flag that would just make > >> sure the opt-in flag is not set even if eligible by global defaults). > > > > True, but I'd avoid all this unnecessary complexity if possible. > > > >>> I feel that the real use cases for selectively enabling caches for > >>> KFENCE are very narrow, and a design that introduces lots of > >>> complexity elsewhere, just to support this feature cannot be justified > >>> (which is why I suggested the simpler design here back in > >>> https://urldefense.com/v3/__https://lore.kernel.org/lkml/CANpmjNNmD9z7oRqSaP72m90kWL7jYH*cxNAZEGpJP8oLrDV-vw@xxxxxxxxxxxxxx/__;Kw!!ACWV5N9M2RV99hQ!Oh4PBJ1NoN9mEgqGqdaNcWuKtJiC6TS_rIbALuqZadQoo93jpVJaFFmXUpOTuzRUdCwcRJWE6uJ4pe0$ > >>> ) > >> > >> I don't mind strongly either way, just a suggestion to consider. > > > > While switching the semantics of the flag from opt-out to opt-in is > > just as valid, I'm more comfortable with the opt-out flag: the rest of > > the logic can stay the same, and we're aware of the fact that changing > > cache coverage by KFENCE shouldn't be something that needs to be done > > manually. > > > > My main point is that opting out or in to only a few select caches > > should be a rarely used feature, and accordingly it should be as > > simple as possible. Honestly, I still don't quite see the point of it, > > and my solution would be to just increase the KFENCE pool, increase > > sample rate, or decrease the "skip covered threshold%". But in the > > case described by Imran, perhaps a running machine is having trouble > > and limiting the caches to be analyzed by KFENCE might be worthwhile > > if a more aggressive configuration doesn't yield anything (and then > > there's of course KASAN, but I recognize it's not always possible to > > switch kernel and run the same workload with it). > > > > The use case for the proposed change is definitely when an admin or > > kernel dev is starting to debug a problem. KFENCE wasn't designed for > > that (vs. deployment at scale, discovery of bugs). As such I'm having > > a hard time admitting how useful this feature will really be, but > > given the current implementation is simple, having it might actually > > help a few people. > > > > Imran, just to make sure my assumptions here are right, have you had > > success debugging an issue in this way? Can you elaborate on what > > "certain debugging scenarios" you mean (admin debugging something, or > > a kernel dev, production fleet, or test machine)? > > > > I have not used kfence in this way because as of now we don't have such newer > kernels in production fleet but I can cite a couple of instances where using > slub_debug for few selected slabs helped me in locating the issue on a > production system where KASAN or even full slub_debug were not feasible. > Apologies in advance if I am elaborating more than you asked for :). This is very useful to understand the use case. > In one case a freed struct mutex was being used later on and by that time same > address had been given to a kmalloc-32 object. The issue was appearing more > frequently if one would enforce some cgroup memory limitation resulting in fork > of a task exiting prematurely. From the vmcore we could see that mutex or more > specifically task_struct.futex_exit_mutex was in bad shape and eventually using > slub_debug for kmalloc-32 pointed to issue. > > Another case involved a mem_cgroup corruption which was causing system crash but > was giving list corruption warnings beforehand. Since list corruption warnings > were coming from cgroup subsystem, corresponding objects were in doubt. > Enabling slub_debug for kmalloc-4k helped in locating the actual corruption. > > Admittedly both of the above issues were result of backporting mistakes but > nonetheless they happened in production systems where very few debugging options > were available. > > By "certain debugging scenarios" I meant such cases where some initial data > (from production fleet) like vmcore or kernel debug messages can give some > pointer towards which slab objects could be wrong and then we would use this > feature (along with further tuning like increasing sampling frequency, pool size > if needed/possible) to pinpoint the actual issue. The idea is that limiting > KFENCE to few slabs will increase the probablity of catching the issue even if > we are not able to tweak pool size. > > Please let me know if it sounds reasonable or if I missed something from your > query. Thanks for the elaboration on use cases - agreed that in few scenarios this feature can help increase the probability of debugging an issue. Reviewed-by: Marco Elver <elver@xxxxxxxxxx> With minor suggestions: > +SLAB_ATTR(skip_kfence); > + ^ Unnecessary space between SLAB_ATTR and #endif. > +#endif > + And the patch title should be something like "kfence: add sysfs interface to disable kfence for selected slabs" (to follow format "<subsys>: <thing changed>").