On 1/9/25 07:02, Marko Hoyer wrote:
Am 09.01.25 um 00:26 schrieb Rob Landley:
CONFIG_EARLY_PRINTK output is emitted before interrupts are enabled
(last I checked they didn't kick in until RIGHT before PID 1 gets
forked off), so the early output drivers spin waiting for the next
character to go into the buffer (the memory mapped register ones look
something like "while (MASK&*status); *output = *data++;" in a for
loop) and the printk() call can't return until all the data has been
queued to the serial hardware, so you spend a lot of time blocked in
printk().
Hi Rob,
thx for the explanation, helps further!
* This implementation would explain the observed behavior.
* What I'm not understanding yet: logs from systemd delay systemd the
same way as seen in the kernel. Looks like the issue is not solved even
when PID 1 is started. As said, It can be something specific to single
core SoCs or even just to RPI Zero W. I'll check ...
Buffering or not in the char device is a driver choice. If your serial
hardware has a small FIFO buffer and the driver doesn't do its own layer
of output buffering (with a tasklet or something to copy the data to the
hardware), then the write() syscall will block waiting for the data to
go out. (Writes to filesystems stopped doing this back around 2.0 or
something, when they rewrote the vfs to be based on the page cache and
deentry cache, meaning ALL filesystem writes go through that now unless
you say O_DIRECT to _ask_ for it to block, which isn't even always
honored. But for some reason the TTY layer drives people insane, and
char devices have been given a wide berth...)
There's a similar issue with some xterms where "make -j16 build"
spamming lots of output to a display terminal can run significantly
slower than "make -j16 build | cat" because the linux pipe
infrastructure inserts a pipe buffer (ulimit -p says 8 but I think
that's _pages_ so 32k? Except in 2.6.11 it was 64k? Eh, not looking it
up...) so the writes from each cc instance go into the pipe buffer and
return immediately when it's not full, whereas writes to a terminal
device block until the terminal has finished updating (which includes
scrolling the screen).
If I recall (many years ago), the kde terminal implementation included a
buffer of its own (immediately returned before updating), and the gnome
one didn't (blocked until x11 display update completed), so foreground
builds were faster under kde.
And the gnome guys' answer was to spray everything down with 3D
acceleration so the GPU was scrolling the screen for you, because of
course it was.
Anyway, serialized latency has _always_ killed throughput, because it's
a cost you can't get BACK. The kernel guys used to know this:
https://yarchive.net/comp/linux/raid0.html
Hence the old high school math problem: if you have 2 hours to go 100
miles and you travel the first 40 miles at 20 miles per hour, how fast
do you have to go the rest to make it on time? Answer: you'd have to
instantaneously teleport because you spent 2 hours going 40 miles and
your time is up with 60 miles left to go. Optimizing the wrong part
DOESN'T HELP.
As said in another mail: I do not know a valid (production) use case in
which kernel logs need to be dumped to a serial console. I regard this
mechanism only as useful for development purposes (in which fast boot is
probably not so relevant). Please correct me if I'm wrong, would be
happy to learn about such use cases.
Based on that I think option 3) is the best options for most cases.
You can adjust the loglevel so they still go into dmesg but don't go out
to the console, which theoretically shouldn't be THAT slow? (At least
cpu limited rather than wait-for-hardware.)
But keep in mind that a lot of kernel devs seem actively trying to
sabotage embedded development, and the dmesg metadata is now multiple
times the size of the actual payload.
https://lkml.iu.edu/hypermail/linux/kernel/2412.0/00045.html
So trimming the dmesg buffer size is probably ALSO a good idea on modern
embedded systems, because WHAT THE FSCK? Also, I dunno how much cpu time
all that metadata fapping takes. (Before systemd, this was just text
going into a ring buffer. But systemd couldn't cope with that because
https://www.theregister.com/2014/04/05/torvalds_sievers_dust_up/ and
https://en.wikipedia.org/wiki/Systemd#History and of course
https://www.linux.com/news/fun-photo-greg-kroah-hartman-crowned-systemd-hackfest/
thus the data structures in the kernel had to become far more complex
after about 30 years of NOT being like that, because systemd.)
Rob