Re: [PATCH] exit: Put an upper limit on how often we can oops

Solar Designer <solar@xxxxxxxxxxxx> · Wed, 9 Nov 2022 17:19:04 +0100

On Tue, Nov 08, 2022 at 11:38:22AM -0800, Kees Cook wrote:
> On Tue, Nov 08, 2022 at 09:24:40AM -0800, Kees Cook wrote:
> > On Mon, Nov 07, 2022 at 10:48:20PM +0100, Jann Horn wrote:
> > > On Mon, Nov 7, 2022 at 10:15 PM Solar Designer <solar@xxxxxxxxxxxx> wrote:
> > > > On Mon, Nov 07, 2022 at 09:13:17PM +0100, Jann Horn wrote:
> > > > > +oops_limit
> > > > > +==========
> > > > > +
> > > > > +Number of kernel oopses after which the kernel should panic when
> > > > > +``panic_on_oops`` is not set.
> > > >
> > > > Rather than introduce this separate oops_limit, how about making
> > > > panic_on_oops (and maybe all panic_on_*) take the limit value(s) instead
> > > > of being Boolean?  I think this would preserve the current behavior at
> > > > panic_on_oops = 0 and panic_on_oops = 1, but would introduce your
> > > > desired behavior at panic_on_oops = 10000.  We can make 10000 the new
> > > > default.  If a distro overrides panic_on_oops, it probably sets it to 1
> > > > like RHEL does.
> > > >
> > > > Are there distros explicitly setting panic_on_oops to 0?  If so, that
> > > > could be a reason to introduce the separate oops_limit.
> > > >
> > > > I'm not advocating one way or the other - I just felt this should be
> > > > explicitly mentioned and decided on.
> > > 
> > > I think at least internally in the kernel, it probably works better to
> > > keep those two concepts separate? For example, sparc has a function
> > > die_nmi() that uses panic_on_oops to determine whether the system
> > > should panic when a watchdog detects a lockup.
> > 
> > Internally, yes, the kernel should keep "panic_on_oops" to mean "panic
> > _NOW_ on oops?" but I would agree with Solar -- this is a counter as far
> > as userspace is concerned. "Panic on Oops" after 1 oops, 2, oopses, etc.
> > I would like to see this for panic_on_warn too, actually.
> 
> Hm, in looking at this more closely, I think it does make sense as you
> already have it. The count is for the panic_on_oops=0 case, so even in
> userspace, trying to remap that doesn't make a bunch of sense. So, yes,
> let's keep this as-is.

I don't follow your logic there - maybe you got confused?  Yes, as
proposed the count is for panic_on_oops=0, but that's just weird - first
kind of request no panic with panic_on_oops=0, then override that with
oops_limit=10000.  I think it is more natural to request
panic_on_oops=10000 in one step.  Also, I think it is more natural to
preserve panic_on_oops=0's meaning of no panic on Oops.

To me, about the only reason to introduce the override is if we want to
literally override a distro's explicit default of panic_on_oops=0.

Alexander