On Mon, 2023-10-09 at 13:05 +0300, Mikko Rapeli wrote: > Yocto Linux distro maintainer Richard Purdie is seeing a regression or behavior > change after updating kernel from 6.4 to 6.5. Yocto runs a lot of automated tests with qemu > where a python test framework configures and spawns qemu (version 8.1) with two > serial ports and boots a newly built kernel and rootfs there. The python test framework > tries to detect getty login prompt from ttyS1 while boot time logging > from qemu, kernel, userspace etc happens on serial console ttyS0. This has worked > reliably for years and across multiple kernel versions. Now with 6.5 kernel there are > suddenly frequent failures where ttyS1 doesn't show the login prompt in the python code > which listens on the qemu socket from target system ttyS1. The target system ttyS0 is working > and boot time logs and getty prompt are visible there. The ttyS1 logs from test framework show > that only two characters CR and LF make it through. This same happens on x86_64 host > running qemu machine for x86_64 and arm64 targets. > > The kernel boot logs from target system with 6.5.5 don't show any errors and getty > processes are started correctly and they don't see any errors either. Data from > target system userspace is being written, but seems to sometimes hang inside kernel > or qemu serial port buffers, and only on the ttyS1 while ttyS0 works reliably all > the time. > > Example target logs with 6.5.5 via ttyS0: > > https://autobuilder.yocto.io/pub/failed-builds-data/6.5%20kernel/j/qemu_boot_log.20231007084853 > or https://pastebin.com/raw/jRRa2CwW > > ttyS1 logs from the same run: > https://autobuilder.yocto.io/pub/failed-builds-data/6.5%20kernel/j/qemu_boot_log.20231007084853.2 > > Kernel config: > https://pastebin.com/raw/Lv9Hfeuh > > Do you have any ideas or hints how to debug or possibly fix this? > > tty layer has seen quite a few major changes between 6.4.14 and 6.5.5 and we're not able > to pinpoint or bisect the issue, since it so annoyingly rare. > > Some more details from Richard below. Brief summary: We're seeing an issue on x86_64 with 6.5.X where data appears to be left in the transmission buffer and not sent to the port on the second serial port (ttyS1) until we trigger it with intervention. Paul Gortmaker did some painful bisection over a few days down to: serial: core: Start managing serial controllers to enable runtime PM https://lore.kernel.org/linux-serial/1431f5b4-fb39-483b-9314-ed2b7c118c11@xxxxxxxxx/T/#t More details: We run very simple plain images under qemu with two serial ports configured. These are x86_64 images with two 16550A ports with the standard x86 port addresses which appear as ttyS0 and ttyS1, nothing special. We run a console and getty on ttyS0 and a getty on ttyS1. After boot, we wait for a getty to appear on ttyS1, login to it and run commands. With 6.5.5 (and latest 6.5.X head as of yesterday), we see an issue say 1% of the time where the getty never appears on ttyS1 even after our timeout of 1000s. When this happens we've added code to login to the ttyS0 getty and run debug commands. We've been able to confirm the getty is running and the init system doesn't matter (happens with sysvinit and systemd). The most interesting debug I've seen is this: root@qemux86-64:~# cat /proc/tty/driver/serial serinfo:1.0 driver revision: 0: uart:16550A port:000003F8 irq:4 tx:418 rx:43 RTS|CTS|DTR|DSR|CD 1: uart:16550A port:000002F8 irq:3 tx:249 rx:0 RTS|CTS|DTR|DSR|CD 2: uart:unknown port:000003E8 irq:4 3: uart:unknown port:000002E8 irq:3 root@qemux86-64:~# echo helloA > /dev/ttyS1 root@qemux86-64:~# echo helloB > /dev/ttyS0 helloB root@qemux86-64:~# cat /proc/tty/driver/serial serinfo:1.0 driver revision: 0: uart:16550A port:000003F8 irq:4 tx:803 rx:121 RTS|CTS|DTR|DSR|CD 1: uart:16550A port:000002F8 irq:3 tx:281 rx:0 RTS|CTS|DTR|DSR|CD 2: uart:unknown port:000003E8 irq:4 3: uart:unknown port:000002E8 irq:3 This is being run after the getty didn't appear for 60s on ttyS1 so we've logged into ttyS0 and run these commands. We've seen that if it doesn't appear after 60s, it won't appear after 1000s either. The tx:249 is interesting as it should be tx:273, 273 being the number of bytes our successful serial getty prompt has. Once we echo something to the port (8 bytes), tx: jumps to 281, so it suddenly found our missing login prompt. This is confirmed with the data appearing on the port after the echo. I did try disabling the autosuspend code in the commit above but it made no difference. Reproducing this is a pain. I thought I had a local setup doing it fairly frequently, then spent an hour this morning without it happening once. I'm hoping someone might recognise what is going on here? I'm happy to try and add debug or find further information if that would help narrow it down. Cheers, Richard