On 1/29/20 7:08 PM, Grzegorz Halat wrote: > Memory management subsystem performs various checks at runtime, > if an inconsistency is detected then such event is being logged and kernel > continues to run. While debugging such problems it is helpful to collect > memory dump as early as possible. Currently, there is no easy way to panic > kernel when such error is detected. > > It was proposed[1] to panic the kernel if panic_on_oops is set but this > approach was not accepted. One of alternative proposals was introduction of > a new sysctl. > > Add a new sysctl - panic_on_inconsistent_mm. If the sysctl is set then the > kernel will be crashed when an inconsistency is detected by memory > management. This currently means panic when bad page or bad PTE > is detected(this may be extended to other places in MM). I wonder, should enabling the sysctl also effectively convert VM_WARN... to VM_BUG... ? > Another use case of this sysctl may be in security-wise environments, > it may be more desired to crash machine than continue to run with > potentially damaged data structures. > > Changes since v1 [2]: > - rename the sysctl to panic_on_inconsistent_mm > - move the sysctl from kernel to vm table > - print modules in print_bad_pte() only before calling panic > > [1] https://lore.kernel.org/linux-mm/1426495021-6408-1-git-send-email-borntraeger@xxxxxxxxxx/ > [2] https://lore.kernel.org/lkml/20200127101100.92588-1-ghalat@xxxxxxxxxx/ > > Signed-off-by: Grzegorz Halat <ghalat@xxxxxxxxxx> > --- > Documentation/admin-guide/sysctl/vm.rst | 14 ++++++++++++++ > include/linux/kernel.h | 1 + > kernel/sysctl.c | 9 +++++++++ > mm/memory.c | 8 ++++++++ > mm/page_alloc.c | 4 +++- > 5 files changed, 35 insertions(+), 1 deletion(-) > > diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst > index 64aeee1009ca..57f7926a64b8 100644 > --- a/Documentation/admin-guide/sysctl/vm.rst > +++ b/Documentation/admin-guide/sysctl/vm.rst > @@ -61,6 +61,7 @@ Currently, these files are in /proc/sys/vm: > - overcommit_memory > - overcommit_ratio > - page-cluster > +- panic_on_inconsistent_mm > - panic_on_oom > - percpu_pagelist_fraction > - stat_interval > @@ -741,6 +742,19 @@ extra faults and I/O delays for following faults if they would have been part of > that consecutive pages readahead would have brought in. > > > +panic_on_inconsistent_mm > +======================== > + > +Controls the kernel's behaviour when inconsistency is detected > +by memory management code, for example bad page state or bad PTE. > + > +0: try to continue operation. > + > +1: panic immediately. > + > +The default value is 0. > + > + > panic_on_oom > ============ > > diff --git a/include/linux/kernel.h b/include/linux/kernel.h > index 0d9db2a14f44..b3bd94c558ab 100644 > --- a/include/linux/kernel.h > +++ b/include/linux/kernel.h > @@ -518,6 +518,7 @@ extern int oops_in_progress; /* If set, an oops, panic(), BUG() or die() is in > extern int panic_timeout; > extern unsigned long panic_print; > extern int panic_on_oops; > +extern int panic_on_inconsistent_mm; > extern int panic_on_unrecovered_nmi; > extern int panic_on_io_nmi; > extern int panic_on_warn; > diff --git a/kernel/sysctl.c b/kernel/sysctl.c > index 70665934d53e..a9733311e3a1 100644 > --- a/kernel/sysctl.c > +++ b/kernel/sysctl.c > @@ -1303,6 +1303,15 @@ static struct ctl_table vm_table[] = { > .extra1 = SYSCTL_ZERO, > .extra2 = &two, > }, > + { > + .procname = "panic_on_inconsistent_mm", > + .data = &panic_on_inconsistent_mm, > + .maxlen = sizeof(int), > + .mode = 0644, > + .proc_handler = proc_dointvec_minmax, > + .extra1 = SYSCTL_ZERO, > + .extra2 = SYSCTL_ONE, > + }, > { > .procname = "panic_on_oom", > .data = &sysctl_panic_on_oom, > diff --git a/mm/memory.c b/mm/memory.c > index 45442d9a4f52..b29a18077a6a 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -71,6 +71,7 @@ > #include <linux/dax.h> > #include <linux/oom.h> > #include <linux/numa.h> > +#include <linux/module.h> > > #include <trace/events/kmem.h> > > @@ -88,6 +89,8 @@ > #warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_cpupid. > #endif > > +int panic_on_inconsistent_mm __read_mostly; > + > #ifndef CONFIG_NEED_MULTIPLE_NODES > /* use the per-pgdat data instead for discontigmem - mbligh */ > unsigned long max_mapnr; > @@ -543,6 +546,11 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr, > vma->vm_ops ? vma->vm_ops->fault : NULL, > vma->vm_file ? vma->vm_file->f_op->mmap : NULL, > mapping ? mapping->a_ops->readpage : NULL); > + > + if (panic_on_inconsistent_mm) { > + print_modules(); > + panic("Bad page map detected"); > + } > dump_stack(); > add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE); > } > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index d047bf7d8fd4..a20cd3ece5ba 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -643,9 +643,11 @@ static void bad_page(struct page *page, const char *reason, > if (bad_flags) > pr_alert("bad because of flags: %#lx(%pGp)\n", > bad_flags, &bad_flags); > - dump_page_owner(page); > > + dump_page_owner(page); > print_modules(); > + if (panic_on_inconsistent_mm) > + panic("Bad page state detected"); > dump_stack(); > out: > /* Leave bad fields for debug, except PageBuddy could make trouble */ >