Re: [boot-time]

Rob Landley <rob@xxxxxxxxxxx> · Thu, 9 Jan 2025 16:31:27 -0600

On 1/9/25 15:35, Marko Hoyer wrote:
Am 09.01.25 um 22:10 schrieb Rob Landley:
Buffering or not in the char device is a driver choice. If your serial 
hardware has a small FIFO buffer and the driver doesn't do its own 
layer of output buffering (with a tasklet or something to copy the 
data to the hardware), then the write() syscall will block waiting for 
the data to go out. (Writes to filesystems stopped doing this back 
around 2.0 or something, when they rewrote the vfs to be based on the 
page cache and deentry cache, meaning ALL filesystem writes go through 
that now unless you say O_DIRECT to _ask_ for it to block, which isn't 
even always honored. But for some reason the TTY layer drives people 
insane, and char devices have been given a wide berth...)

Yeah looks like this is the case for RPi Zero W. I guess there is 
probably no buffer at all in the RPi serial driver / hw since every log 
line from systemd delays systemd for ~10ms (~80ms in baud9600 case).

Well there's gotta be a LITTLE fifo for input or you drop characters all 
over the place.

(That's the reason Linus started writing Linux in the first place, 
because minix's microkernel design couldn't keep up with serial input, 
the overhead of the task switch to the userspace serial receive driver 
process took too long and characters got dropped. So he wrote a terminal 
program that booted from a floppy, and then taught it to read from and 
write to the minix filesystem on his hard drive so he could download 
stuff from usenet, then taught it to run "bash" so he didn't have to 
reboot to mkdir/mv/rom, and that turned out to be 90% of the way to 
getting it to run gcc...)

And serial hardware tends to be symmetrical about that: if it's got 16 
chars of input buffer, it'll usually have 16 chars of output buffer. But 
that's less than 1/50th of a second at 9600 baud...

(Fun detail: the input fifo often has a programmable watermark so you 
can say "fill up this much before generating an interrupt, or if X timer 
ticks pass by with no more input" so you don't get an interrupt every 
character (and spend all your time entering and exiting the interrupt 
code) BUT still have some leeway between the interrupt being generated 
and the buffer filling up until it drops characters. The OUTPUT fifo can 
do something similar, only from the other end (fill it all the way up, 
then generate an interrupt when it drains to the watermark so you can 
refill it before it empties and produces a gap in the output).

Programming serial devices can get slightly complicated...

Btw: I can confirm the same for RPi3 w/ four cores. Difference is that 
something seems to go on in kernel in parallel to logs writing to serial 
but at a certain point the kernel is waiting again for lot of seconds 
probably for the serial device to finish transmission. Systemds delay is 
pretty much similar to the single core case.

Yeah, the point of a bottleneck is that's the part you're waiting for, 
so speeding up the rest of it doesn't help so much.

Optimization is a whole thing. Spinlocks vs semaphores infuriate some 
people (you're intentionally spinning wasting time?) so sometimes you 
need to explain with analogies to get them to stop "helping".

You're standing at a train crossing, and a train is going past, it'll be 
through in 10 minutes. If you walk towards the end of the train you'll 
reach the end faster and can cross in only 7 minutes, but if you need to 
come BACK HERE to where your road is you'll wind up walking 7 minutes, 
crossing, walking 7 minutes back, and resume from here 14 minutes from 
now instead of only 10, so being busy doing the wrong thing and then 
just _undoing_ it again instead of waiting here ready to go is actually 
_slower_ than the waiting.

As said in another mail: I do not know a valid (production) use case 
in which kernel logs need to be dumped to a serial console. I regard 
this mechanism only as useful for development purposes (in which fast 
boot is probably not so relevant). Please correct me if I'm wrong, 
would be happy to learn about such use cases.

Based on that I think option 3) is the best options for most cases.

You can adjust the loglevel so they still go into dmesg but don't go 
out to the console, which theoretically shouldn't be THAT slow? (At 
least cpu limited rather than wait-for-hardware.)

With quiet logs go into dmesg as well.

Which _used_ to be almost free back when it was just a ring buffer doing 
a strlen() and two memcpy() at the wrap. But these days: dunno, haven't 
benched it.

But as said, i do not really see use cases to dump out these logs to a 
serial console in a boot time critical system on each production boot. 
Reading dmesg or systemd's journal after time critical things are done 
should be ok in most case.

The switch from printk(blah) to pr_loglevel(blah) was IN THEORY so you 
could kconfig a minimum loglevel to retain, and all the macros below 
that level would drop out of the kernel at compile time, reducing the 
kernel image size significantly AND doing nice things with cache 
locality and so on. (String processing is expensive, you traverse a lot 
of data that goes through the memory bus and evicts cache lines from L1 
and L2.)

Last I checked the kernel devs had broken it for some reason, but it 
might be working again? (Or was a patch still out of tree...?) Anyway, 
if you run out of ideas that's a thing to look for.

Data going across the memory bus is another one of those bottleneck 
things, where it doesn't matter how fast your processor is clocked if 
you're waiting for memory. An order of magnitude down from where we're 
currently looking, but still a thing that comes up a lot once the real 
low hanging fruit is dealt with...

Of course there's all sorts of "Loop unrolling! No, smaller L1 cache 
footprint! Prefetch! No, spectre/meltdown!" pendulum nonsense I usually 
treat roughly the same way as the man trying to cross the street in The 
Pink Panther:

https://www.youtube.com/watch?v=nistdsACs3E

I once watched using a lookup table instead of calculating the value be 
an optimization, then a pessimization, then an optimization, then a 
pessimization, without even recompiling the binary (just upgrading the 
hardware). Doing the simple thing is always at least excusable. (And 
less to reverse engineer to understand WHY, and a good general argument 
against the endless "this is not helping". Basically Chesterton's fence 
in software: understanding why it's there lets you throw it out.)

Rob