On Tue, 2019-01-29 at 19:46 +0000, Christopher Lameter wrote: > On Tue, 29 Jan 2019, Miles Chen wrote: > > > a) classic slub issue. e.g., use-after-free, redzone overwritten. It's > > more efficient to report a issue as soon as slub detects it. (comparing > > to monitor the log, set a breakpoint, and re-produce the issue). With > > the coredump file, we can analyze the issue. > > What usually happens is that the systems fails with a strange error > message. Then the system is rebooted using slub_debug options and the > issue is reproduced yielding more information about the problem. > > Then you run the scenario again with additional debugging in the subsystem > that caused the problem. Thanks your comments and patient. I now understand the difference between us. I usually enable CONFIG_SLUB_DEBUG=y, CONFIG_SLUB_DEBUG_ON=y and setup slub_debug by default and do all tests. (eng mode). Not hit an issue first, then setup slub_debug and reproduce the issue again. CONFIG_SLUB_DEBUG is disabled for products. > > So you are already reproducing the issue because you need to activate > debugging to get more information. Doing it for the 3rd time is not that > much more difficult. > > None of your modifications will be active in a production kernel. > slub_debug must be activated to use it and thus you are already > reproducing the issue. > > > b) memory corruption issues caused by h/w write. e.g., memory > > overwritten by a DMA engine. Memory corruptions may or may not related > > to the slab cache that reports any error. For example: kmalloc-256 or > > dentry may report the same errors. If we can preserve the the coredump > > file without any restore/reset processing in slub, we could have more > > information of this memory corruption. > > If debugging is active then reporting will include the accurate slab cache > affected. The memory layout is already changing when you enable the > existing debugging code. None of your code runs without that and thus is > cannot add a coredump for the prod case without debugging. I usually set slub_debug by default and get the coredump file. > > c) memory corruption issues caused by unstable h/w. e.g., bit flipping > > because of xxxx DRAM die or applying new power settings. It's hard to > > re-produce this kind of issue and it much easier to tell this kind of > > issue in the coredump file without any restore/reset processing. > > But then you patch does not help in this situation because the code has to > be enabled by special slub debug options. > > > > Users can set the option by slub_debug. We can still have the original > > behavior(keep the system alive) if the option is not set. We can turn on > > the option when we need the coredump file. (with panic_on_warn is set, > > of course). > > I think we would need to turn on debugging by default and have your patch > for this to make sense. We already reproducing the issue multiple times > for debugging. This patch does not change that. > yes. I turn on the debugging by default. Does that make sense now? Thanks again for your comments.