Re: [boot-time]

Marko Hoyer <mhoyer.oss-devel@xxxxxxxxxx> · Thu, 9 Jan 2025 22:35:42 +0100

Am 09.01.25 um 22:10 schrieb Rob Landley:
On 1/9/25 07:02, Marko Hoyer wrote:
Am 09.01.25 um 00:26 schrieb Rob Landley:
CONFIG_EARLY_PRINTK output is emitted before interrupts are enabled 
(last I checked they didn't kick in until RIGHT before PID 1 gets 
forked off), so the early output drivers spin waiting for the next 
character to go into the buffer (the memory mapped register ones 
look something like "while (MASK&*status); *output = *data++;" in a 
for loop) and the printk() call can't return until all the data has 
been queued to the serial hardware, so you spend a lot of time 
blocked in printk().

Hi Rob,

thx for the explanation, helps further!

* This implementation would explain the observed behavior.

* What I'm not understanding yet: logs from systemd delay systemd the 
same way as seen in the kernel. Looks like the issue is not solved 
even when PID 1 is started.  As said, It can be something specific to 
single core SoCs or even just to RPI Zero W. I'll check ...

Buffering or not in the char device is a driver choice. If your serial 
hardware has a small FIFO buffer and the driver doesn't do its own 
layer of output buffering (with a tasklet or something to copy the 
data to the hardware), then the write() syscall will block waiting for 
the data to go out. (Writes to filesystems stopped doing this back 
around 2.0 or something, when they rewrote the vfs to be based on the 
page cache and deentry cache, meaning ALL filesystem writes go through 
that now unless you say O_DIRECT to _ask_ for it to block, which isn't 
even always honored. But for some reason the TTY layer drives people 
insane, and char devices have been given a wide berth...)

Yeah looks like this is the case for RPi Zero W. I guess there is 
probably no buffer at all in the RPi serial driver / hw since every log 
line from systemd delays systemd for ~10ms (~80ms in baud9600 case).

Btw: I can confirm the same for RPi3 w/ four cores. Difference is that 
something seems to go on in kernel in parallel to logs writing to serial 
but at a certain point the kernel is waiting again for lot of seconds 
probably for the serial device to finish transmission. Systemds delay is 
pretty much similar to the single core case.

There's a similar issue with some xterms where "make -j16 build" 
spamming lots of output to a display terminal can run significantly 
slower than "make -j16 build | cat" because the linux pipe 
infrastructure inserts a pipe buffer (ulimit -p says 8 but I think 
that's _pages_ so 32k? Except in 2.6.11 it was 64k? Eh, not looking it 
up...) so the writes from each cc instance go into the pipe buffer and 
return immediately when it's not full, whereas writes to a terminal 
device block until the terminal has finished updating (which includes 
scrolling the screen).

If I recall (many years ago), the kde terminal implementation included 
a buffer of its own (immediately returned before updating), and the 
gnome one didn't (blocked until x11 display update completed), so 
foreground builds were faster under kde.

And the gnome guys' answer was to spray everything down with 3D 
acceleration so the GPU was scrolling the screen for you, because of 
course it was.

Anyway, serialized latency has _always_ killed throughput, because 
it's a cost you can't get BACK. The kernel guys used to know this:

https://yarchive.net/comp/linux/raid0.html

Hence the old high school math problem: if you have 2 hours to go 100 
miles and you travel the first 40 miles at 20 miles per hour, how fast 
do you have to go the rest to make it on time? Answer: you'd have to 
instantaneously teleport because you spent 2 hours going 40 miles and 
your time is up with 60 miles left to go. Optimizing the wrong part 
DOESN'T HELP.

Absolutely correct.
As said in another mail: I do not know a valid (production) use case 
in which kernel logs need to be dumped to a serial console. I regard 
this mechanism only as useful for development purposes (in which fast 
boot is probably not so relevant). Please correct me if I'm wrong, 
would be happy to learn about such use cases.

Based on that I think option 3) is the best options for most cases.

You can adjust the loglevel so they still go into dmesg but don't go 
out to the console, which theoretically shouldn't be THAT slow? (At 
least cpu limited rather than wait-for-hardware.)

With quiet logs go into dmesg as well.

But as said, i do not really see use cases to dump out these logs to a 
serial console in a boot time critical system on each production boot. 
Reading dmesg or systemd's journal after time critical things are done 
should be ok in most case.

So my recommendation to people who seek for fastboot potential:

* switch of kernel to console log during startup using quiet cmdline 
option if you don't need these logs over serial essentially

* if you need them, your (Rob) options could be applied reducing the 
effect as best as possible

But keep in mind that a lot of kernel devs seem actively trying to 
sabotage embedded development, and the dmesg metadata is now multiple 
times the size of the actual payload.

https://lkml.iu.edu/hypermail/linux/kernel/2412.0/00045.html

So trimming the dmesg buffer size is probably ALSO a good idea on 
modern embedded systems, because WHAT THE FSCK? Also, I dunno how much 
cpu time all that metadata fapping takes. (Before systemd, this was 
just text going into a ring buffer. But systemd couldn't cope with 
that because 
https://www.theregister.com/2014/04/05/torvalds_sievers_dust_up/ and 
https://en.wikipedia.org/wiki/Systemd#History and of course 
https://www.linux.com/news/fun-photo-greg-kroah-hartman-crowned-systemd-hackfest/ 
thus the data structures in the kernel had to become far more complex 
after about 30 years of NOT being like that, because systemd.)

Rob
Marko