Re: Kernel 6.5 ttyS1 hang with qemu (was Re: [OE-core] Summary of the remaining 6.5 kernel serial issue (and 6.5 summary)

Richard Purdie <richard.purdie@xxxxxxxxxxxxxxxxxxx> · Mon, 16 Oct 2023 09:10:04 +0100

On Mon, 2023-10-16 at 10:23 +0300, Tony Lindgren wrote:
> * Mikko Rapeli <mikko.rapeli@xxxxxxxxxx> [231016 07:16]:
> > Hi,
> > 
> > On Mon, Oct 16, 2023 at 09:35:01AM +0300, Tony Lindgren wrote:
> > > * Richard Purdie <richard.purdie@xxxxxxxxxxxxxxxxxxx> [231015 21:30]:
> > > > On Sun, 2023-10-15 at 17:31 +0200, Greg Kroah-Hartman wrote:
> > > > > Can you try the patch below?  I just sent it to Linus and it's from Tony
> > > > > to resolve some other pm issues with the serial port code.
> > > > 
> > > > Thanks for the pointer to this. I've put it through some testing and
> > > > had one failure so far so I suspect this isn't enough unfortunately.
> > > > 
> > > > FWIW I was looping the testing on the complete removal of the
> > > > conditions and didn't see any failures with that.
> > > 
> > > Care to clarify what's the failing test now?

Failure is where the data never all makes it through ttyS1 and the
login prompt doesn't appear. In our CI this appears like this

https://autobuilder.yoctoproject.org/typhoon/#/builders/145/builds/711/steps/12/logs/stdio

Click the magnify glass to make searchable, then search for "Target
didn't reach login banner in 1000 seconds". You'll then see the echo
helloB to /dev/ttyS1, then the "Extra log data read:" containing the
getty for ttyS1 which woke up.

> > > 
> > > Is the issue still the second port not always coming up after boot or
> > > something else?
> > 
> > Yes, data from the ttyS1 getty is not coming through from kernel and qemu to
> > the test framework looking for login prompt after qemu machine boot.
> > Workarounds like sending "\n\n" from the test framework through qemu to ttyS1
> > or "echo helloB > /dev/ttyS1" via working ttyS0 don't seem to help and wake
> > it up.
> 
> OK so for trying to reproduce this with qemu, is this with the default uarts
> or with some -device pci-serial-2x type options?

The port sometimes doesn't come up properly at boot.

To be clear, the "\n\n" from the qemu side into the port doesn't seem
to help. The "echo helloB > /dev/ttyS1" inside the image does seem to
wake it up. 

The qemu command we're using is:

qemu-system-x86_64 -device virtio-net-pci,netdev=net0,mac=52:54:00:12:34:02 
    -serial tcp:127.0.0.1:50421 
    -serial tcp:127.0.0.1:46457  
    -netdev tap,id=net0,ifname=tap0,script=no,downscript=no 
    -object rng-random,filename=/dev/urandom,id=rng0 -device virtio-rng-pci,rng=rng0 
    -drive file=./core-image-minimal-qemux86-64.rootfs.ext4.19673,if=virtio,format=raw -usb -device usb-tablet -usb -device usb-kbd   
    -cpu IvyBridge -machine q35,i8042=off -smp 4 -enable-kvm -m 256 
    -pidfile /media/build/poky/aa/pidfile_19670  -S -qmp unix:./.rbmorp7r,server,wait -qmp unix:./.kf6y7yqg,server,nowait -nographic  
    -kernel /media/build/poky/build/tmp/deploy/images/qemux86-64/bzImage 
    -append 'root=/dev/vda rw  ip=192.168.7.2::192.168.7.1:255.255.255.0::eth0:off:8.8.8.8 net.ifnames=0 console=ttyS0 console=ttyS1 oprofile.timer=1 tsc=reliable no_timer_check rcupdate.rcu_expedited=1 swiotlb=0  printk.time=1'

This is with qemu 8.1.0. The image we're testing with usually doesn't
have an ssh server so we're using the serial ports for control/testing
and not the networking even if we configure it. 

We use the serial ports over the tcp connections to handle the multiple
ports and have python code for that. I did extract that code into a
more standalone form, https://www.rpsys.net/wp/rp/simpleqemu.tgz where
"./commands.py" will then run a boot and wait for the login prompt.
You'd need to set a path in commands.py to point at an images directory
Extra log data read:of an qemux86-64 core-image-minimal OE image build.
Setting runqemuparams = 'nographic' gets rid of the graphics need.

If you don't have a build, I shared a prebuilt image and hacked config
which you could point it at:
https://www.rpsys.net/wp/rp/simpleqemu-images.tgz

I did have to remove some of the cpu qemu options to make it work with
the older qemu versions often found on distros which don't support them
(the q35 machine). It will dump log files into the current directory
and there will be a log for each serial port.

I hacked the script to append sys.argv[1] to the log filename, then
experimented with a command like:

for i in `seq 1 88`; do ./commands.py $i & done

which launches 88 qemus in parallel. Sometimes you see them "hang"
through the size of the ".2" logfile:

$ ls *.2 -la | cut -d ' ' -f 5 | sort | uniq -c
      1 134
      1 249
      1 251
      2 254
      1 255
     51 273

273 is the correct size, the smaller ones are truncated. 

Sadly in some cases this test appears to work when the issue is still
present with the issue only showing up intermittently in our CI. It
does seem timing dependent.

Let me know if can provide any other info.

Cheers,

Richard