Hi linux-serial and Greg, Yocto Linux distro maintainer Richard Purdie is seeing a regression or behavior change after updating kernel from 6.4 to 6.5. Yocto runs a lot of automated tests with qemu where a python test framework configures and spawns qemu (version 8.1) with two serial ports and boots a newly built kernel and rootfs there. The python test framework tries to detect getty login prompt from ttyS1 while boot time logging from qemu, kernel, userspace etc happens on serial console ttyS0. This has worked reliably for years and across multiple kernel versions. Now with 6.5 kernel there are suddenly frequent failures where ttyS1 doesn't show the login prompt in the python code which listens on the qemu socket from target system ttyS1. The target system ttyS0 is working and boot time logs and getty prompt are visible there. The ttyS1 logs from test framework show that only two characters CR and LF make it through. This same happens on x86_64 host running qemu machine for x86_64 and arm64 targets. The kernel boot logs from target system with 6.5.5 don't show any errors and getty processes are started correctly and they don't see any errors either. Data from target system userspace is being written, but seems to sometimes hang inside kernel or qemu serial port buffers, and only on the ttyS1 while ttyS0 works reliably all the time. Example target logs with 6.5.5 via ttyS0: https://autobuilder.yocto.io/pub/failed-builds-data/6.5%20kernel/j/qemu_boot_log.20231007084853 or https://pastebin.com/raw/jRRa2CwW ttyS1 logs from the same run: https://autobuilder.yocto.io/pub/failed-builds-data/6.5%20kernel/j/qemu_boot_log.20231007084853.2 Kernel config: https://pastebin.com/raw/Lv9Hfeuh Do you have any ideas or hints how to debug or possibly fix this? tty layer has seen quite a few major changes between 6.4.14 and 6.5.5 and we're not able to pinpoint or bisect the issue, since it so annoyingly rare. Some more details from Richard below. Cheers, -Mikko On Sun, Oct 08, 2023 at 09:23:57AM +0100, Richard Purdie wrote: > On Sat, 2023-10-07 at 23:05 +0100, Richard Purdie via > lists.openembedded.org wrote: > > I thought I'd summarise where things are at with the 6.5 kernel. > > > > We've fixed: > > * the ARM LTP OOM lockup (kernel patch) > > * the locale ARM selftest failure which was OOM due to silly buffer > > allocations in 6.5 (kernel commandline) > > * the ARM jitterentropy errors (kernel patch) > > * the cryptodev build failures (recipe updated) > > > > We've also: > > * disabled the strace tests that fail with 6.5. > > * made sure the serial ports and getty counts match > > * added ttyrun which wraps serial consoles and avoids hacks > > * made the qemurunner logging save all the port logs > > * made the qemurunner write the binary data it is sent verbatim > > * made sure to use nodelay on qemu's tcpserial > > > > This leaves an annoying serial console problem where ttyS1 never has > > the getty login prompt appear. > > > > What we know: > > > > * We've only seen this on x86 more recently (yesterday/today) but have > > seen it on ARM in the days before that. > > > > * It affects both sysvinit and systemd images. > > > > * Systemd does print that it started a getty on ttyS0 and ttyS1 when > > the failure occurs > > > > * There is a getty running according to "ps" when the failure occurs > > > > * There are only ever one or three characters received to ttyS1 in the > > failure case (0x0d and 0x0a chars, i.e. CR and LF) > > > > * It can't be any kind of utf-8 conversion issue since the login prompt > > isn't visible in the binary log > > > > * the kernel boot logs do show the serial port created with the same > > ioport and irq on x86. > > > > Previously we did see some logs with timing issues on the ttyS0 port > > but the nodelay parameter may have helped with that. > > > > There are debug patches in master-next against qemurunner which try and > > poke around to gather more debug when things fail using ttyS0. > > > > The best failure log we have is now this one: > > > > https://autobuilder.yoctoproject.org/typhoon/#/builders/79/builds/5874/steps/14/logs/stdio > > > > where I've saved the logs: > > > > https://autobuilder.yocto.io/pub/failed-builds-data/6.5%20kernel/j/qemu_boot_log.20231007084853 > > and > > https://autobuilder.yocto.io/pub/failed-builds-data/6.5%20kernel/j/qemu_boot_log.20231007084853.2 > > > > You can see ttyS1 times out after 1000 seconds and the port only has a > > single byte (in the .2 file). The other log shows ps output showing the > > getty running for ttyS1. > > > > Ideas welcome on where from here. > > > > I've tweaked master-next to keep reading the ttyS1 port after we poke > > it from ttyS0 to see if that reveals anything next time it fails (build > > running). > > Testing overnight with the new debug yielded: > > https://autobuilder.yoctoproject.org/typhoon/#/builders/87/builds/5895/steps/14/logs/stdio > > The interesting bit being: > > """ > WARNING: core-image-full-cmdline-1.0-r0 do_testimage: Extra read data: > Poky (Yocto Project Reference Distro) 4.2+snapshot- > 7cb4ffbd8380b0509d7fac9191095379af321686 qemux86-64 ttyS1 > > qemux86-64 login: helloA > > Poky (Yocto Project Reference Distro) 4.2+snapshot- > 7cb4ffbd8380b0509d7fac9191095379af321686 qemux86-64 ttyS1 > qemux86-64 login: > > """ > > i.e. the getty didn't appear in 1000s but sometime in shutdown the > original prompt, the "helloA" and the new getty prompt did. > > So the data *is* there but stuck in a buffer somehow. Kernel or qemu > side, I don't know. > > Cheers, > > Richard > > > >