Re: [PATCH RFC net-next v3 0/8] netconsole: Add support for CPU population

Breno Leitao <leitao@xxxxxxxxxx> · Mon, 27 Jan 2025 01:52:35 -0800

Hello Andrew,

On Fri, Jan 24, 2025 at 05:02:26PM +0100, Andrew Lunn wrote:
> On Fri, Jan 24, 2025 at 07:16:39AM -0800, Breno Leitao wrote:
> > The current implementation of netconsole sends all log messages in
> > parallel, which can lead to an intermixed and interleaved output on the
> > receiving side. This makes it challenging to demultiplex the messages
> > and attribute them to their originating CPUs.
> > 
> > As a result, users and developers often struggle to effectively analyze
> > and debug the parallel log output received through netconsole.
> 
> I know very little about consoles and netconsle, so this is probably a
> silly question:
> 
> Why is this a netconsole problem, and not a generic console problem?

This issue isn't inherent to netconsole. To provide more context and
clarity, let me take a step back and revisit the history of this
discussion, where the idea of adding enriched format originated.

Initially, Calvin proposed adding similar messages, such as the kernel
release version information to messages via printk, but this approach
was deemed inappropriate. The discussions could be found the following
link:

https://lore.kernel.org/all/51047c0f6e86abcb9ee13f60653b6946f8fcfc99.1463172791.git.calvinowens@xxxxxx/

Later, we shifted to implementing such enriched messages in netconsole,
which proved to be a less intrusive solution. I implemented the release
append in netconsole, effectively addressing Calvin's original concern.

https://lore.kernel.org/all/20230714111330.3069605-1-leitao@xxxxxxxxxx/

The release append proved to be very useful, the concept evolved
further during discussions at Linux Plumbers Conference, where we
developed the userdata feature, where any userspace data/text can append
any message that flies together with the message.

https://www.youtube.com/watch?v=ILTqn1EYIXQ

This functionality has become *extremely* valuable for hyperscale
environments, leading to current efforts to expand its capabilities
- specifically by adding CPU information and, in future updates, the
current task name.

For instance, at meta, we append service name that is running when
"something happen" (warning, crash, etc) in the kernel. That helps to
narrow down and categorize issues very easily.

> Can other console types also send in parallel? Do they have the same
> issue of intermixing?

Interpreting logs is straightforward when dealing with a single machine.

However, the complexity increases exponentially when managing a large
number of servers and processing logs to gather metrics on systems,
kernels, and more.

For instance, let's come back to appending the kernel version. When
working with a single kernel/host, identifying the kernel version for
a host is simple. If a warning message appears, you can easily attribute
it to that specific kernel version. 

In contrast, with millions of servers running multiple kernel versions
and releases, the challenge lies in accurately mapping warnings to their
corresponding kernel versions and releases, that is why having the
kernel release together with the message make the mapping easy.

Thanks for your time reading it and the discussion,
--breno