Re: Ingenic X SoC cache problems

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Brief update on the current situation.

I have been working with Professor Zhou (zhouyanjie@xxxxxxxxxxxxxx) to diagnose this problem in the last 2 days. After multiple tests, we finally pinpointed the problem: the SLOB allocator.

Using unpatched 5.18 kernel, under stress test, all DMA operations worked like a charm w/ SLAB or SLUB, but not SLOB.

The stress tests were conducted using the CU1000-Neo board. DMA-enabled components were: SPI & MSC (using PDMA), and dwc2 & SFC (they're bus masters). Also, the memory & kernel data structures debugging were ALL enabled in order to catch more silent memory corruptions. It involves performing these operations together:
1. Continuously reading the eMMC storage via MSC0 (while :; do dd if=/dev/mmcblk0 of=/dev/null bs=1M 2> /dev/null; sleep 1; done&)
2. Continuously reading the SPI NOR storage via SFC (while :; do dd if=/dev/mtdblock0 of=/dev/null bs=1M 2> /dev/null; sleep 1; done&)
3. Continuously refreshing a ST7789V SPI LCD using the fb_tft driver (while :; do cat /dev/urandom > /dev/fb0 2> /dev/null; sleep 0.2; done&)
4. Enable the USB CDC ACM gadget and continuously transfer large amount of data (PC side: cat /dev/urandom > /dev/ttyACM0) (X1000 side: cat /dev/ttyGS0 > /dev/null)

With SLAB or SLUB, the X1000 survived these tests for more than 30 minutes. No silent corruptions were reported by the kernel.

With SLOB, it instantly dies at the boot process (before init). Sometimes it's a linked list corruption, sometimes it's a null ptr dereference, and sometimes it simply becomes silent.

I always used SLOB for devices with little RAM and thought it would be beneficial. But I never thought it would be a problem.

Should this be forwarded to the linux-mm mailing list?

Thanks and best regards!

On 5/27/22 19:03, Yunian Yang wrote:
> Hello all.
> 
> In the past month, I was struggling with random memory corruptions and crashes on the Ingenic X1000. After some detailed testing, I need to point out, the current cache management routines seems to be incorrect for X1000, and maybe all X series SoCs. It mainly affects DMA operations. Every form of peripheral to RAM transfer will corrupt the RAM, and this includes the dwc2 and SFC's DMA and the PDMA controller. If all the DMAs are disabled (e.g. hard coding dma_capable = false in dwc2), it will be fine running CPU and I/O benchmarks for a week. If you have the hardware, you can enable the kernel data structures & memory debugging and see for yourself.
> 
> So I went back and looked at Ingenic's old 4.4 and 3.10 kernel sources. They used a separate file (sc-xburst.c) for the cache routines, which is based on an very old sc-mips.c. And there are two important macros, called MIPS_CACHE_SYNC_WAR and MIPS_BRIDGE_SYNC_WAR. They're both set to 1. However these macros are removed from the kernel long time ago. The line `mips_sc_ops.bc_wback_inv = mips_bridge_sync_war;' seems to be the key point. 
> 
> Do you have any recommendations of what could be done to fix this problem?
> 
> Thanks and best regards!




[Index of Archives]     [LKML Archive]     [Linux ARM Kernel]     [Linux ARM]     [Git]     [Yosemite News]     [Linux SCSI]     [Linux Hams]

  Powered by Linux