Re: Serial console is causing system lock-up

Sergey Senozhatsky <sergey.senozhatsky.work@xxxxxxxxx> · Tue, 12 Mar 2019 11:32:31 +0900

On (03/07/19 15:21), John Ogness wrote:
> > John, sorry to ask this, does new printk() design always provide
> > latency guarantees good enough for PREEMPT_RT?
> 
> Yes, because it is assumed that emergency messages will never occur for
> a correctly running system.
>
[..]
> Obviously as soon as any emergency message appears, an _unacceptable_
> latency occurs. But that is considered OK because the system is no
> longer running correctly and it is worth the price to pay to get those
> messages with such high reliability.

OK, so what *I'm learning* from this bug report:

10) WARN/ERR messages do not necessarily tell us that the stability of the
    system was jeopardized. The system can "run correctly" and be
    "perfectly healthy".

20) We can have N CPUs reporting issues simultaneously. Even in production.
    Such patterns exist in the kernel.

30) The "reporting part" - printk()->call_console_drivers() - can be the
    slowest one.

  In this particular case, given that Mikulas saw dropped messages,
  checksum calculation was significantly faster than call_console_drivers().
  Now, suppose we have new printk, and suppose we have CPUs A B C D, each of
  them reports a checksum error:

  A prb_lock owner    B prb_lock    C prb_lock    D prb_lock

  A calls call_console_drivers, unlocks prb_lock
  B grabs prb_lock
  B calls call_console_drivers
  A calculates new checksum mismatch
  A calls printk and spins on prb_lock, behind D

  So now we have:

  B prb_lock owner    C prb_lock    D prb_lock    A prb_lock

  And so on

  B C D A  ->  C D A B  ->  D A B C  ->  A B C D  ->  ...

  After M rounds of error reporting (M > N), each CPU, had have to busy
  wait M times (N - 1). Sounds quadratic.

40) goto 10

So I have some doubts regarding some of assumptions behind new printk
design. And the problem is not in prb_lock() unfairness. Current printk
design does look to me SMP-friendly; yes, it has unbound printing loop;
that can be addressed. But it doesn't turn SMP system into UP.

	-ss

--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/dm-devel