Re: [PATCH v2] sched/all: Change all BUG_ON() instances in the scheduler to WARN_ON_ONCE()

Mel Gorman <mgorman@xxxxxxx> · Tue, 16 Aug 2022 10:40:56 +0100

On Mon, Aug 15, 2022 at 03:12:19PM -0700, John Hubbard wrote:
> On 8/15/22 07:41, Mel Gorman wrote:
> > For the rest, I didn't see obvious recovery paths that would allow the
> > system to run predictably. Any of them firing will have unpredictable
> > consequences (e.g. move_queued_task firing would be fun if it was a per-cpu
> > kthread). Depending on which warning triggers, the remaining life of the
> > system may be very short but maybe long enough to be logged even if system
> > locks up shortly afterwards.
> 
> Hi Mel,
> 
> Are you basically saying, "WARN_ON*() seems acceptable in mm/, because
> we can at least get the problem logged before it locks up, probably"?
> 

I don't consider this to be a subsystem-specific guideline and I am certainly
not suggesting that mm/ is the gold standard but my understanding was
"If a warning is recoverable in a sensible fashion then do so". A warning
is always bad, but it may be possible for the system to continue long
enough for the warning to be logged or an admin to schedule a reboot at
a controlled time. I think Ingo is doing the right thing with this patch
to avoid an unnecessary panic but for at least some of them, a partial
recovery is possible so why not do it seeing as it's been changed anyway?

The outcome of a warning is very variable, it might be a coding error where
state is unexpected but it can be reset to a known state, maybe some data
is leaked that if it persists we eventually OOM, maybe a particular feature
will no longer work as expected (e.g. load balancing doesn't balance load
due to a warning) or maybe the whole system will fail completely in the
near future, possibly before an error can be logged. In the scheduler
side, a warning may mean that a task never runs again and another means
that a task is in the wrong scheduling class which means ... no idea,
but it's not good if it's a deadline task that gets manipulated by the
normal scheduling class instead.

For mm/, take three examples

1. compaction.c warning. PFN ranges are unexpected, reset them.
   Compaction is less effective and opportunities were missed but the system
   will not blow up. Maybe at worst a hugetlb allocation would fail when it
   might otherwise have succeeded or maybe SLUB degrades because it can't
   use a high-order page for a cache that could reside in an order-0 page
   instead. It varies.

2. Warning in in filemap_unaccount_folio -- possible loss of unwritten
   data on normal filesystems. Bad, might cause inconsistency for state but
   maybe it'll be dirtied again and it'll recover eventually.  It's certainly
   not good if that data got lost forever.

3. The warning in isolate_movable_page is bad, page flags state is
   either corrupted by a driver or maybe there was a bit flip error but a
   page may be migrated unexpectedly leading to Who Knows What, corruption
   of a page that gets freed and reused by another process? It's much worse
   than compaction restoring its internal state of where it is scanning
   but maybe not a complete disaster.

> Or are you implying that we should instead introduce and use some new
> PANIC_ON() or VM_PANIC_ON() set of macros that would allow proper "log
> then reboot" behavior?
> 

No, not as a general rule. VM_PANIC_ON would suggest it's under DEBUG_VM
which is not always set and I do not see how either PANIC_ON or VM_PANIC_ON
could be coded in such a way that it always gets logged because it depends
on the context the bad condition occurred.

I also don't think the kernel could conditionally warn or panic for a
specific instance because if an admin wants a panic on a warning (can be
very useful for a crash dump), then panic_on_warn is available.

Either way, PANIC_ON or VM_PANIC_ON is outside the scope of this patch.

My understanding was that a panic() should only be used in the case where the
kernel is screwed and if tries to continue, the outcome will be much worse
or it's early in boot and the error condition means it'll be impossible
to boot anyway.

-- 
Mel Gorman
SUSE Labs