On 1/9/25 15:35, Marko Hoyer wrote:
Am 09.01.25 um 22:10 schrieb Rob Landley:
Buffering or not in the char device is a driver choice. If your serial
hardware has a small FIFO buffer and the driver doesn't do its own
layer of output buffering (with a tasklet or something to copy the
data to the hardware), then the write() syscall will block waiting for
the data to go out. (Writes to filesystems stopped doing this back
around 2.0 or something, when they rewrote the vfs to be based on the
page cache and deentry cache, meaning ALL filesystem writes go through
that now unless you say O_DIRECT to _ask_ for it to block, which isn't
even always honored. But for some reason the TTY layer drives people
insane, and char devices have been given a wide berth...)
Yeah looks like this is the case for RPi Zero W. I guess there is
probably no buffer at all in the RPi serial driver / hw since every log
line from systemd delays systemd for ~10ms (~80ms in baud9600 case).
Well there's gotta be a LITTLE fifo for input or you drop characters all
over the place.
(That's the reason Linus started writing Linux in the first place,
because minix's microkernel design couldn't keep up with serial input,
the overhead of the task switch to the userspace serial receive driver
process took too long and characters got dropped. So he wrote a terminal
program that booted from a floppy, and then taught it to read from and
write to the minix filesystem on his hard drive so he could download
stuff from usenet, then taught it to run "bash" so he didn't have to
reboot to mkdir/mv/rom, and that turned out to be 90% of the way to
getting it to run gcc...)
And serial hardware tends to be symmetrical about that: if it's got 16
chars of input buffer, it'll usually have 16 chars of output buffer. But
that's less than 1/50th of a second at 9600 baud...
(Fun detail: the input fifo often has a programmable watermark so you
can say "fill up this much before generating an interrupt, or if X timer
ticks pass by with no more input" so you don't get an interrupt every
character (and spend all your time entering and exiting the interrupt
code) BUT still have some leeway between the interrupt being generated
and the buffer filling up until it drops characters. The OUTPUT fifo can
do something similar, only from the other end (fill it all the way up,
then generate an interrupt when it drains to the watermark so you can
refill it before it empties and produces a gap in the output).
Programming serial devices can get slightly complicated...
Btw: I can confirm the same for RPi3 w/ four cores. Difference is that
something seems to go on in kernel in parallel to logs writing to serial
but at a certain point the kernel is waiting again for lot of seconds
probably for the serial device to finish transmission. Systemds delay is
pretty much similar to the single core case.
Yeah, the point of a bottleneck is that's the part you're waiting for,
so speeding up the rest of it doesn't help so much.
Optimization is a whole thing. Spinlocks vs semaphores infuriate some
people (you're intentionally spinning wasting time?) so sometimes you
need to explain with analogies to get them to stop "helping".
You're standing at a train crossing, and a train is going past, it'll be
through in 10 minutes. If you walk towards the end of the train you'll
reach the end faster and can cross in only 7 minutes, but if you need to
come BACK HERE to where your road is you'll wind up walking 7 minutes,
crossing, walking 7 minutes back, and resume from here 14 minutes from
now instead of only 10, so being busy doing the wrong thing and then
just _undoing_ it again instead of waiting here ready to go is actually
_slower_ than the waiting.
As said in another mail: I do not know a valid (production) use case
in which kernel logs need to be dumped to a serial console. I regard
this mechanism only as useful for development purposes (in which fast
boot is probably not so relevant). Please correct me if I'm wrong,
would be happy to learn about such use cases.
Based on that I think option 3) is the best options for most cases.
You can adjust the loglevel so they still go into dmesg but don't go
out to the console, which theoretically shouldn't be THAT slow? (At
least cpu limited rather than wait-for-hardware.)
With quiet logs go into dmesg as well.
Which _used_ to be almost free back when it was just a ring buffer doing
a strlen() and two memcpy() at the wrap. But these days: dunno, haven't
benched it.
But as said, i do not really see use cases to dump out these logs to a
serial console in a boot time critical system on each production boot.
Reading dmesg or systemd's journal after time critical things are done
should be ok in most case.
The switch from printk(blah) to pr_loglevel(blah) was IN THEORY so you
could kconfig a minimum loglevel to retain, and all the macros below
that level would drop out of the kernel at compile time, reducing the
kernel image size significantly AND doing nice things with cache
locality and so on. (String processing is expensive, you traverse a lot
of data that goes through the memory bus and evicts cache lines from L1
and L2.)
Last I checked the kernel devs had broken it for some reason, but it
might be working again? (Or was a patch still out of tree...?) Anyway,
if you run out of ideas that's a thing to look for.
Data going across the memory bus is another one of those bottleneck
things, where it doesn't matter how fast your processor is clocked if
you're waiting for memory. An order of magnitude down from where we're
currently looking, but still a thing that comes up a lot once the real
low hanging fruit is dealt with...
Of course there's all sorts of "Loop unrolling! No, smaller L1 cache
footprint! Prefetch! No, spectre/meltdown!" pendulum nonsense I usually
treat roughly the same way as the man trying to cross the street in The
Pink Panther:
https://www.youtube.com/watch?v=nistdsACs3E
I once watched using a lookup table instead of calculating the value be
an optimization, then a pessimization, then an optimization, then a
pessimization, without even recompiling the binary (just upgrading the
hardware). Doing the simple thing is always at least excusable. (And
less to reverse engineer to understand WHY, and a good general argument
against the endless "this is not helping". Basically Chesterton's fence
in software: understanding why it's there lets you throw it out.)
Rob