Re: [boot-time]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 1/9/25 15:35, Marko Hoyer wrote:
Am 09.01.25 um 22:10 schrieb Rob Landley:
Buffering or not in the char device is a driver choice. If your serial hardware has a small FIFO buffer and the driver doesn't do its own layer of output buffering (with a tasklet or something to copy the data to the hardware), then the write() syscall will block waiting for the data to go out. (Writes to filesystems stopped doing this back around 2.0 or something, when they rewrote the vfs to be based on the page cache and deentry cache, meaning ALL filesystem writes go through that now unless you say O_DIRECT to _ask_ for it to block, which isn't even always honored. But for some reason the TTY layer drives people insane, and char devices have been given a wide berth...)

Yeah looks like this is the case for RPi Zero W. I guess there is probably no buffer at all in the RPi serial driver / hw since every log line from systemd delays systemd for ~10ms (~80ms in baud9600 case).

Well there's gotta be a LITTLE fifo for input or you drop characters all over the place.

(That's the reason Linus started writing Linux in the first place, because minix's microkernel design couldn't keep up with serial input, the overhead of the task switch to the userspace serial receive driver process took too long and characters got dropped. So he wrote a terminal program that booted from a floppy, and then taught it to read from and write to the minix filesystem on his hard drive so he could download stuff from usenet, then taught it to run "bash" so he didn't have to reboot to mkdir/mv/rom, and that turned out to be 90% of the way to getting it to run gcc...)

And serial hardware tends to be symmetrical about that: if it's got 16 chars of input buffer, it'll usually have 16 chars of output buffer. But that's less than 1/50th of a second at 9600 baud...

(Fun detail: the input fifo often has a programmable watermark so you can say "fill up this much before generating an interrupt, or if X timer ticks pass by with no more input" so you don't get an interrupt every character (and spend all your time entering and exiting the interrupt code) BUT still have some leeway between the interrupt being generated and the buffer filling up until it drops characters. The OUTPUT fifo can do something similar, only from the other end (fill it all the way up, then generate an interrupt when it drains to the watermark so you can refill it before it empties and produces a gap in the output).

Programming serial devices can get slightly complicated...

Btw: I can confirm the same for RPi3 w/ four cores. Difference is that something seems to go on in kernel in parallel to logs writing to serial but at a certain point the kernel is waiting again for lot of seconds probably for the serial device to finish transmission. Systemds delay is pretty much similar to the single core case.

Yeah, the point of a bottleneck is that's the part you're waiting for, so speeding up the rest of it doesn't help so much.

Optimization is a whole thing. Spinlocks vs semaphores infuriate some people (you're intentionally spinning wasting time?) so sometimes you need to explain with analogies to get them to stop "helping".

You're standing at a train crossing, and a train is going past, it'll be through in 10 minutes. If you walk towards the end of the train you'll reach the end faster and can cross in only 7 minutes, but if you need to come BACK HERE to where your road is you'll wind up walking 7 minutes, crossing, walking 7 minutes back, and resume from here 14 minutes from now instead of only 10, so being busy doing the wrong thing and then just _undoing_ it again instead of waiting here ready to go is actually _slower_ than the waiting.

As said in another mail: I do not know a valid (production) use case in which kernel logs need to be dumped to a serial console. I regard this mechanism only as useful for development purposes (in which fast boot is probably not so relevant). Please correct me if I'm wrong, would be happy to learn about such use cases.

Based on that I think option 3) is the best options for most cases.

You can adjust the loglevel so they still go into dmesg but don't go out to the console, which theoretically shouldn't be THAT slow? (At least cpu limited rather than wait-for-hardware.)

With quiet logs go into dmesg as well.

Which _used_ to be almost free back when it was just a ring buffer doing a strlen() and two memcpy() at the wrap. But these days: dunno, haven't benched it.

But as said, i do not really see use cases to dump out these logs to a serial console in a boot time critical system on each production boot. Reading dmesg or systemd's journal after time critical things are done should be ok in most case.

The switch from printk(blah) to pr_loglevel(blah) was IN THEORY so you could kconfig a minimum loglevel to retain, and all the macros below that level would drop out of the kernel at compile time, reducing the kernel image size significantly AND doing nice things with cache locality and so on. (String processing is expensive, you traverse a lot of data that goes through the memory bus and evicts cache lines from L1 and L2.)

Last I checked the kernel devs had broken it for some reason, but it might be working again? (Or was a patch still out of tree...?) Anyway, if you run out of ideas that's a thing to look for.

Data going across the memory bus is another one of those bottleneck things, where it doesn't matter how fast your processor is clocked if you're waiting for memory. An order of magnitude down from where we're currently looking, but still a thing that comes up a lot once the real low hanging fruit is dealt with...

Of course there's all sorts of "Loop unrolling! No, smaller L1 cache footprint! Prefetch! No, spectre/meltdown!" pendulum nonsense I usually treat roughly the same way as the man trying to cross the street in The Pink Panther:

https://www.youtube.com/watch?v=nistdsACs3E

I once watched using a lookup table instead of calculating the value be an optimization, then a pessimization, then an optimization, then a pessimization, without even recompiling the binary (just upgrading the hardware). Doing the simple thing is always at least excusable. (And less to reverse engineer to understand WHY, and a good general argument against the endless "this is not helping". Basically Chesterton's fence in software: understanding why it's there lets you throw it out.)

Rob




[Index of Archives]     [Gstreamer Embedded]     [Linux MMC Devel]     [U-Boot V2]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux ARM Kernel]     [Linux OMAP]     [Linux SCSI]

  Powered by Linux