Hi, Thanks a lot Paul! Updating linux-serial that bisection shows 84a9582fd203 serial: core: Start managing serial controllers to enable runtime PM caused this regression in ttyS1 functionality with qemu. Lots of more details below. Cheers, -Mikko On Fri, Oct 13, 2023 at 11:15:35AM -0400, Paul Gortmaker wrote: > So, here is an update for those not following on IRC: > > [Re: [OE-core] Summary of the remaining 6.5 kernel serial issue (and 6.5 summary)] On 11/10/2023 (Wed 00:50) Paul Gortmaker wrote: > > [snip details of getting RP's chunk of AB python working as a 1% reproducer] > > > Next steps: > > > > Each run takes approximately 20s, and with a 1.5% success factor, a > > minimum of 150 clean runs is needed before one can say "OK, *maybe* this > > kernel currently under test is OK. That translates to over an hour to > > get a data point you have some faith in as being "good". [Still a heck > > of a lot better than doing 150 *builds* though!] > > > > On the other hand, just one instance of missing login on ttyS1 is > > concretely "bad" and hence the test run can be stopped as soon as one > > instance of that is seen. > > > > This "good is really maybe perhaps good, but bad is for sure bad" type > > of bisects that I love to hate. > > > > Since the data points are "expensive", I probably won't just attempt a > > mindless brute force bisect. I'd probably do something more like... > > > > --ensure modules (or lack of) don't impact the results, so I don't have > > to rebundle rootfs images. > > This turned out to be true - didn't need modules. > > > --maybe see if I can build with host gcc vs. using devshell/sysroot gcc > > and still see the issue. Although really, kernel build time optimization > > won't be the bottleneck in this particular case, so not worthwhile? > > I didn't bother with the above, since build time was not significant here. > > > --check vanilla v6.5 (i.e. v6.5.0 -- no linux-stable content) so I know > > whether to dig around in v6.5.0 --> v6.5.5, or dig around in the > > much much larger v6.4 --> v6.5 additions. I expect it to be the > > latter, i.e. digging through mainline, but who knows. > > As expected, it followed v6.5.0 and linux-stable isn't to blame here. > > > --assuming mainline introduced it, instead of a brute force bisect, I'd > > look for merges by Linus from gregKH pull requests for the tty and > > serial subsystems in the merge window (v6.4 ---> v6.5-rc1) and test > > both sides of that/those merge(s). > > This also largely went according to plan. I started with the tty merge: > > v6.5-rc1~74: > 868a9fd94807 Merge tag 'tty-6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty > > and did burrow down into the 60-odd commits within. One complication > was that there were interim mainline merges into the dev stream: > > $ git log --oneline --merges v6.5-rc1~74^2 ^v6.5-rc1~74^1 > 2076b2a4a6b7 Merge 6.4-rc5 into tty-next > 0e4daea31d83 Merge 6.4-rc3 into tty-next > > ..which usually indicate a dependency on work from another subsystem. > Fortunately the test pass/fail boundary didn't land on one of those. > > > --if things haven't completely fallen apart by that point, I'd be left > > with a relatively small set of subsystem specific commits from a single > > pull request which then could be bisected. > > We did somehow manage to get here with just 60 commits to consider. > > > > > Sometimes in the past, when I've got down to that small set of commits, > > the culprit almost jumps out at you when scanning over the shortlogs. > > While I didn't immediately spot the commit via the shortlog, extensive > test iterations did give us confidence in pinpointing a single commit: > > 84a9582fd203 serial: core: Start managing serial controllers to enable runtime PM ******* <---- FAIL (33/1436) > ae62c49c0cef serial: 8250: omap: convert to modern PM ops <---- PASS (0/1593) > > > Anyway, that is where my thoughts are at. But of course we don't even > > yet know if this "test" is even accurately reflecting the OEQA/AB issue! > > So nobody should get their hopes up too early. > > I think now, we can say with a solid degree of confidence that the PM > commit gets to own the responsibility of causing the ttyS1 getty not > flushing the login banner and prompt. Hence why I did over 1000 runs on > both the PASS and FAIL commit boundary. > > Unfortunately, there is a lot in that commit ; also a lot in the > discussion and the twelve iterations before acceptance that I've not > even looked at yet: > > https://lore.kernel.org/linux-serial/1431f5b4-fb39-483b-9314-ed2b7c118c11@xxxxxxxxx/T/#t > > My "next steps" at this point would be to try and attempt to break down > that relatively large commit into multiple "sub-commits" in order to use > the AB test harness portion to better get an idea of which change > component is the root cause of the ttyS1 getty stall/seizure. > > I won't get to that until Saturday at the earliest - going AFK after > sending this. However, Mikko already opened a dialog with the > linux-serial folks a few days ago: > > https://lore.kernel.org/all/ZSPQY6UYg21Z0PnN@nuoska/ > > Mikko -- if you want to take our pseudo bisect information (below) and > continue that thread; sharing what we've learned since, and also loop in > Tony (author of v6.4-rc3-31-g84a9582fd203) and others on the CC there, > that might be a good next step to do now, while I am AFK. > > A quick technical note on testing - all my kernels have been built in > the same devshell instance, so same path, same compiler, same ".config", > same "test harness" from RP, same qemu, same underlying machine. Both > Richcard and I have also found that loading on the host machine running > the qemu binary (ubu 20.04 default in my case) influences the rate of > incidence - an otherwise idle machine is less likely to show the issue > vs a machine that has other things going on -- indicating a qemu > scheduling component influences the incident rate. > > Paul. > -- > > Note asterik "*****" marking suspect commit. > > v6.5.7 <---- FAIL (2/188) > v6.5.5 <---- FAIL (13/1039) > v6.5.0 <---- FAIL (6/422) > v6.5-rc1 <---- FAIL (2/230) > > v6.5-rc1~74: > 868a9fd94807 Merge tag 'tty-6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty > > $git log --oneline v6.5-rc1~74^2 ^v6.5-rc1~74^1 > e534756c9412 tty_audit: make data of tty_audit_log() const <---- FAIL (3/266) > e64ed44bce43 tty_audit: make tty pointers in exposed functions const > 3e540a7bad85 tty_audit: make icanon a bool > 94f94810efc4 tty_audit: invert the condition in tty_audit_log() > e1488513abee tty_audit: use kzalloc() in tty_audit_buf_alloc() > 777e456852f7 tty_audit: use TASK_COMM_LEN for task comm > a82d62f70854 Revert "8250: add support for ASIX devices with a FIFO bug" > 27a826837ec9 serial: atmel: don't enable IRQs prematurely > 930cbf92db01 tty: serial: Add Nuvoton ma35d1 serial driver support > e0edfdc15863 tty: serial: fsl_lpuart: add earlycon for imx8ulp platform > 639949a7031e tty: serial: imx: fix rs485 rx after tx > e8cc334847db selftests: tty: add selftest for tty timestamp updates > 360c11e2258c tty: tty_io: update timestamps on all device nodes > 4903fde8047a tty: fix hang on tty device with no_room set > cef09673c35b serial: core: fix -EPROBE_DEFER handling in init > 20a41a62618d serial: 8250_omap: Use force_suspend and resume for system suspend > f3710f5e9e1a tty: serial: samsung_tty: Use abs() to simplify some code > 832e231cff47 tty: serial: samsung_tty: Fix a memory leak in s3c24xx_serial_getclk() when iterating clk > a9c09546e903 tty: serial: samsung_tty: Fix a memory leak in s3c24xx_serial_getclk() in case of error > d9f59caf94a9 serial: 8250: Apply FSL workarounds also without SERIAL_8250_CONSOLE > 1eea99c04555 powerpc/legacy_serial: Handle SERIAL_8250_FSL=n build failures > 52861a3be8d6 serial: core: don't kfree device managed data > b6c7ff2693dd serial: 8250_mtk: Simplify clock sequencing and runtime PM > 643662d12bc2 serial: st-asc: fix typo in property name > 66eff0ef528b powerpc/legacy_serial: Warn about 8250 devices operated without active FSL workarounds > 2076b2a4a6b7 Merge 6.4-rc5 into tty-next <---- FAIL (2/145) > 49c80922ff81 serial: core: Fix error handling for serial_core_ctrl_device_add() <---- FAIL (6/992) > 539914240a01 serial: core: Fix probing serial_base_bus devices <---- FAIL (3/145) > d0a396083e91 serial: core: Don't drop port_mutex in serial_core_remove_one_port > 84a9582fd203 serial: core: Start managing serial controllers to enable runtime PM ******* <---- FAIL (33/1436) > ae62c49c0cef serial: 8250: omap: convert to modern PM ops <---- PASS (0/1593) > c53aab207622 serial: 8250: omap: Move uart_write() inside PM section > d2d4bd217ccd serial: 8250-fsl: Expand description of the MPC83xx UART's misbehaviour > 1b997aef4f54 serial: Indicate fintek option may also be required for RS232 support > 20ec397d694b tty: serial: fsl_lpuart: Check the return value of dmaengine_tx_status <---- PASS (0/316) > 0d07703be74f serial: Switch i2c drivers back to use .probe() > d0b309a5d3f4 serial: 8250: synchronize and annotate UART_IER access > 25614735a647 serial: 8250: lock port for UART_IER access in omap8250_irq() > 8b45503776b6 serial: 8250: lock port for omap8250_restore_regs() > 87660fb4041f serial: 8250: lock port for rx_dma() callback > 51e45fba14bf serial: core: lock port for start_rx() in uart_resume_port() > ca73a892c5be serial: 8250: lock port for stop_rx() in omap8250_irq() > abcb0cf1f5b2 serial: core: lock port for stop_rx() in uart_suspend_port() > b1207d86169d serial: 8250: lock port in startup() callbacks > 0e4daea31d83 Merge 6.4-rc3 into tty-next > d5b3d02d0b10 serial: Make uart_remove_one_port() return void > 6bd6cd29c924 serial: stm32: Ignore return value of uart_remove_one_port() in .remove() > 72fc578b220c serial: 8250_rt288x: Remove unnecessary UART_REG_UNMAPPED > 33e3b0eb7550 serial: 8250_rt288x: Name non-standard divisor latch reg > b334214ea08d serial: 8250: RT288x/Au1xxx code away from core > 30c61f53fdf2 serial: 8250: Add dl_read/write, bugs and mapsize into plat_serial8250_port > 98658ae8f392 serial: 8250: Document uart_8250_port's ->dl_read/write() > b245aa0cc583 serial: 8250: Change dl_read/write to handle value as u32 > 9d86719f8769 serial: 8250: Allow using ports higher than SERIAL_8250_RUNTIME_UARTS > db86bb6ed497 serial: 8250: omap: Shut down on remove for console uart > fef4f600319e serial: 8250: omap: Fix life cycle issues for interrupt handlers > 398cecc24846 serial: 8250: omap: Fix imprecise external abort for omap_8250_pm() > b9ab22c2bc86 serial: 8250: omap: Fix freeing of resources on failed register > 6d8c1fca0a6e serial: pl011: set UART011_CR_RXE in pl011_set_termios after port shutdown > cf9aa72d2f91 tty: serial: fsl_lpuart: optimize the timer based EOP logic > > Baseline of tty queue for v6.5 was: > $git show --oneline --stat cf9aa72d2f91~ > ac9a78681b92 (tag: v6.4-rc1) Linux 6.4-rc1 > Makefile | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > Have also 1000+ runs on Yocto's v6.4.16 based kernel with no fails. On Mon, Oct 09, 2023 at 01:05:23PM +0300, Mikko Rapeli via lists.openembedded.org wrote: > Hi linux-serial and Greg, > > Yocto Linux distro maintainer Richard Purdie is seeing a regression or behavior > change after updating kernel from 6.4 to 6.5. Yocto runs a lot of automated tests with qemu > where a python test framework configures and spawns qemu (version 8.1) with two > serial ports and boots a newly built kernel and rootfs there. The python test framework > tries to detect getty login prompt from ttyS1 while boot time logging > from qemu, kernel, userspace etc happens on serial console ttyS0. This has worked > reliably for years and across multiple kernel versions. Now with 6.5 kernel there are > suddenly frequent failures where ttyS1 doesn't show the login prompt in the python code > which listens on the qemu socket from target system ttyS1. The target system ttyS0 is working > and boot time logs and getty prompt are visible there. The ttyS1 logs from test framework show > that only two characters CR and LF make it through. This same happens on x86_64 host > running qemu machine for x86_64 and arm64 targets. > > The kernel boot logs from target system with 6.5.5 don't show any errors and getty > processes are started correctly and they don't see any errors either. Data from > target system userspace is being written, but seems to sometimes hang inside kernel > or qemu serial port buffers, and only on the ttyS1 while ttyS0 works reliably all > the time. > > Example target logs with 6.5.5 via ttyS0: > > https://autobuilder.yocto.io/pub/failed-builds-data/6.5%20kernel/j/qemu_boot_log.20231007084853 > or https://pastebin.com/raw/jRRa2CwW > > ttyS1 logs from the same run: > https://autobuilder.yocto.io/pub/failed-builds-data/6.5%20kernel/j/qemu_boot_log.20231007084853.2 > > Kernel config: > https://pastebin.com/raw/Lv9Hfeuh > > Do you have any ideas or hints how to debug or possibly fix this? > > tty layer has seen quite a few major changes between 6.4.14 and 6.5.5 and we're not able > to pinpoint or bisect the issue, since it so annoyingly rare. > > Some more details from Richard below. > > Cheers, > > -Mikko > > On Sun, Oct 08, 2023 at 09:23:57AM +0100, Richard Purdie wrote: > > On Sat, 2023-10-07 at 23:05 +0100, Richard Purdie via > > lists.openembedded.org wrote: > > > I thought I'd summarise where things are at with the 6.5 kernel. > > > > > > We've fixed: > > > * the ARM LTP OOM lockup (kernel patch) > > > * the locale ARM selftest failure which was OOM due to silly buffer > > > allocations in 6.5 (kernel commandline) > > > * the ARM jitterentropy errors (kernel patch) > > > * the cryptodev build failures (recipe updated) > > > > > > We've also: > > > * disabled the strace tests that fail with 6.5. > > > * made sure the serial ports and getty counts match > > > * added ttyrun which wraps serial consoles and avoids hacks > > > * made the qemurunner logging save all the port logs > > > * made the qemurunner write the binary data it is sent verbatim > > > * made sure to use nodelay on qemu's tcpserial > > > > > > This leaves an annoying serial console problem where ttyS1 never has > > > the getty login prompt appear. > > > > > > What we know: > > > > > > * We've only seen this on x86 more recently (yesterday/today) but have > > > seen it on ARM in the days before that. > > > > > > * It affects both sysvinit and systemd images. > > > > > > * Systemd does print that it started a getty on ttyS0 and ttyS1 when > > > the failure occurs > > > > > > * There is a getty running according to "ps" when the failure occurs > > > > > > * There are only ever one or three characters received to ttyS1 in the > > > failure case (0x0d and 0x0a chars, i.e. CR and LF) > > > > > > * It can't be any kind of utf-8 conversion issue since the login prompt > > > isn't visible in the binary log > > > > > > * the kernel boot logs do show the serial port created with the same > > > ioport and irq on x86. > > > > > > Previously we did see some logs with timing issues on the ttyS0 port > > > but the nodelay parameter may have helped with that. > > > > > > There are debug patches in master-next against qemurunner which try and > > > poke around to gather more debug when things fail using ttyS0. > > > > > > The best failure log we have is now this one: > > > > > > https://autobuilder.yoctoproject.org/typhoon/#/builders/79/builds/5874/steps/14/logs/stdio > > > > > > where I've saved the logs: > > > > > > https://autobuilder.yocto.io/pub/failed-builds-data/6.5%20kernel/j/qemu_boot_log.20231007084853 > > > and > > > https://autobuilder.yocto.io/pub/failed-builds-data/6.5%20kernel/j/qemu_boot_log.20231007084853.2 > > > > > > You can see ttyS1 times out after 1000 seconds and the port only has a > > > single byte (in the .2 file). The other log shows ps output showing the > > > getty running for ttyS1. > > > > > > Ideas welcome on where from here. > > > > > > I've tweaked master-next to keep reading the ttyS1 port after we poke > > > it from ttyS0 to see if that reveals anything next time it fails (build > > > running). > > > > Testing overnight with the new debug yielded: > > > > https://autobuilder.yoctoproject.org/typhoon/#/builders/87/builds/5895/steps/14/logs/stdio > > > > The interesting bit being: > > > > """ > > WARNING: core-image-full-cmdline-1.0-r0 do_testimage: Extra read data: > > Poky (Yocto Project Reference Distro) 4.2+snapshot- > > 7cb4ffbd8380b0509d7fac9191095379af321686 qemux86-64 ttyS1 > > > > qemux86-64 login: helloA > > > > Poky (Yocto Project Reference Distro) 4.2+snapshot- > > 7cb4ffbd8380b0509d7fac9191095379af321686 qemux86-64 ttyS1 > > qemux86-64 login: > > > > """ > > > > i.e. the getty didn't appear in 1000s but sometime in shutdown the > > original prompt, the "helloA" and the new getty prompt did. > > > > So the data *is* there but stuck in a buffer somehow. Kernel or qemu > > side, I don't know. > > > > Cheers, > > > > Richard > > > > > > > > > > -=-=-=-=-=-=-=-=-=-=-=- > Links: You receive all messages sent to this group. > View/Reply Online (#188831): https://lists.openembedded.org/g/openembedded-core/message/188831 > Mute This Topic: https://lists.openembedded.org/mt/101849211/7159507 > Group Owner: openembedded-core+owner@xxxxxxxxxxxxxxxxxxxxxx > Unsubscribe: https://lists.openembedded.org/g/openembedded-core/unsub [mikko.rapeli@xxxxxxxxxx] > -=-=-=-=-=-=-=-=-=-=-=- >