On Wed, Sep 14, 2016 at 7:43 PM, Martin Townsend <mtownsend1973@xxxxxxxxx> wrote: > Hi Greg, > > > On Wed, Sep 14, 2016 at 6:14 PM, Greg KH <greg@xxxxxxxxx> wrote: >> On Wed, Sep 14, 2016 at 04:15:11PM +0100, Martin Townsend wrote: >>> Hi, >>> >>> We are seeing a very strange problem here with our embedded AM4378 >>> (ARM Cortex A9) board. We have 2 UARTS configured as RS485 ports and >>> very occasionally we are seeing a byte missing from a received message >>> and it's always the 49th byte. By occasionally it could take around >>> 8000 or more received messages before we see it. >>> >>> I can reproduce the problem by configuring both TTY ports the same >>> (raw mode) using >>> stty -F /dev/ttyS3 intr undef quit undef erase undef kill undef eof >>> undef eol undef eol2 undef swtch undef start undef stop undef susp >>> undef rprnt undef werase undef lnext undef discard undef \ >>> parenb -parodd -cmspar cs8 -hupcl -cstopb cread clocal -crtscts \ >>> -ignbrk -brkint -ignpar -parmrk inpck -istrip -inlcr -igncr -icrnl >>> -ixon -ixoff -iuclc -ixany -imaxbel -iutf8 \ >>> -opost -olcuc -ocrnl -onlcr -onocr -onlret -ofill -ofdel nl0 cr0 tab0 >>> bs0 vt0 ff0 \ >>> -isig -icanon -iexten -echo -echoe -echok -echonl -noflsh -xcase >>> -tostop -echoprt -echoctl -echoke -flusho -extproc 115200 >>> >>> and then loopback the 2 UARTs on the board and use cat and echo. >>> >>> Here's an example of the missing byte. >>> .... >>> 123456789012345678901234567890123456789012345678x0 >>> 123456789012345678901234567890123456789012345678x0 >>> 1234567890123456789012345678901234567890123456780 >>> 123456789012345678901234567890123456789012345678x0 >>> 123456789012345678901234567890123456789012345678x0 >>> .... >>> >>> By using the following stress-ng command >>> >>> stress-ng --cpu-load 50 --io 4 --vm 2 --vm-bytes 128M --fork 4 >>> >>> It will happen more often but not straightaway you may have to leave >>> it for a while and then it starts happening more frequently for a >>> period and then it's fine again. Increasing the cpu-load didn't seem >>> to make it happen more frequently but that is just from observation. >>> >>> The serial driver is 8250, as it's a TI CPU the driver is the OMAP version. >>> The Linux kernel is 4.1 LTSI. >> >> Does this happen in 4.7? 4.8-rc? >> >> 4.1 is really old, loads of things have been fixed since then... > > I could build a newer Kernel but we have committed to 4.1 LTSI and I > have seen that a lot has changed in the newer kernels (including RS485 > half duplex support) which looks like it won't back port too easily. >> >>> I'm at a loss on how to debug this, I tried putting printk's in the >>> tty_buffer code to see if the actual message being passed in from the >>> serial driver contained the full message to try and work out if the >>> driver was the problem but I didn't get any output in dmesg. >>> >>> One thing I also noticed that maybe completely unrelated is that I >>> accidentally turned on the echo on the receiving end >>> stty -F /dev/ttyS2 echo >>> and then transmitted a long string >>> echo '123456789012345678901234567890123456789012345678x012345678' > /dev/ttyS3 >>> With this I always get >>> 123456789012345678901234567890123456789012345678x >>> The x is the 49th byte. >> >> echo and cat are not good things to use for serial ports, they do not >> handle error handling properly or flow control. So you can't really >> know much about this. > > A lot of people would agree with you on this :) cat and echo probably > aren't the best tools but they were easiest to hand that could > reproduce the failure. The original failure was detected using a > Modbus RTU stack but it took ages to fire. I have managed to get > printk working and have tracked the problem down to the driver so I > should be able to find the problem now. If it's anything that I > haven't done with my half duplex changes I'll report back but I > suspect this is where the problem lies. > >> >> thanks, >> >> greg k-h > > Cheers, > Martin. The logic analyser traces showed that the byte wasn't lost on transmission so must be on reception. Looking through the OMAP 8250 driver I see nothing obvious but I did see that DMA was supported (CONFIG_SERIAL_8250_DMA) so I tried this and it's been running fine under cpu load for a while now. In case this is useful to anyone else that comes across the same problem I had to force the OMAP_DMA_TX_KICK habit in the probe function. // if (of_machine_is_compatible("ti,am33xx")) priv->habit |= OMAP_DMA_TX_KICK; for the AM4378 device we are using. You also have to add the relevant dma properties to the device tree. for me it was &uart2 { ... dmas = <&edma 30 &edma 31>; dma-names = "tx", "rx"; }; Cheers, Martin. -- To unsubscribe from this list: send the line "unsubscribe linux-serial" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html