Re: Serial console is causing system lock-up

Sergey Senozhatsky <sergey.senozhatsky.work@xxxxxxxxx> · Thu, 14 Mar 2019 19:30:45 +0900

On (03/13/19 09:43), John Ogness wrote:
> I don't understand how you can think "print or die trying" is replaced
> with another "print or die trying".

Sorry, let me explain. In some contexts CPUs which are spinning on
prb_lock don't do anything else. A careful placement of

        touch_softlockup_watchdog_sync();
        clocksource_touch_watchdog();
        rcu_cpu_stall_reset();
        touch_nmi_watchdog();

keeps the watchdogs away, yes, but that doesn't mean that we are not
sitting on a time bomb. Think of RCU, for instance. We keep rcu_cpu_stall
silent and things can look OK, but that doesn't mean that RCU is OK in
reality; spinning CPUs may hold off grace periods. So now a relatively
simple issue - raid checksum mismatch in this particular case - has
potential to become OOM. Quadratic CPU serialisation doesn't scale.
Throw enough reporting CPUs on it and we may get very close to some
big problems. Does this make sense?

This bug report demonstrates that we can have N CPUs reporting warns
simultaneously. And I think that people would want to have pr_warns
and WARN_ONs to be printed as emergency level messages (it sort of
sounds reasonable. I understand that you have different opinion on this).

And what I'm thinking is that *probably* we can have a bit less radical
approach - the system is not always doomed when it WARNs us - and a bit
more "best effort" one. *May be* we don't need to apply full serialisation
all the time. *May be* full serialisation can be applied only when we see
that we are about to run out of free space in logbuf. Or may be can
start dynamically resize the logbuf. And so on.

> By the way, Sergey, I appreciate your skepticism.

Sorry, John. I know I'm a PITA.

	-ss

--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/dm-devel