On Thu, Mar 17, 2022 at 9:17 AM Robin Murphy <robin.murphy@xxxxxxx> wrote: > > On 2022-03-17 12:26, Peter Geis wrote: > > On Thu, Mar 17, 2022 at 6:37 AM Robin Murphy <robin.murphy@xxxxxxx> wrote: > >> > >> On 2022-03-17 00:14, Peter Geis wrote: > >>> Good Evening, I've added the Designware driver maintainers, since the Rockchip host driver uses the dwc driver. > >>> > >>> I apologize for raising this email chain from the dead, but there have > >>> been some developments that have introduced even more questions. > >>> I've looped the Rockchip mailing list into this too, as this affects > >>> rk356x, and likely the upcoming rk3588 if [1] is to be believed. > >>> > >>> TLDR for those not familiar: It seems the rk356x series (and possibly > >>> the rk3588) were built without any outer coherent cache. > >>> This means (unless Rockchip wants to clarify here) devices such as the > >>> ITS and PCIe cannot utilize cache snooping. > >>> This is based on the results of the email chain [2]. > >>> > >>> The new circumstances are as follows: > >>> The RPi CM4 Adventure Team as I've taken to calling them has been > >>> attempting to get a dGPU working with the very broken Broadcom > >>> controller in the RPi CM4. > >>> Recently they acquired a SoQuartz rk3566 module which is pin > >>> compatible with the CM4, and have taken to trying it out as well. > >>> > >>> This is how I got involved. > >>> It seems they found a trivial way to force the Radeon R600 driver to > >>> use Non-Cached memory for everything. > >>> This single line change, combined with using memset_io instead of > >>> memset, allows the ring tests to pass and the card probes successfully > >>> (minus the DMA limitations of the rk356x due to the 32 bit > >>> interconnect). > >>> I discovered using this method that we start having unaligned io > >>> memory access faults (bus errors) when running glmark2-drm (running > >>> glmark2 directly was impossible, as both X and Wayland crashed too > >>> early). > >>> I traced this to using what I thought at the time was an unsafe memcpy > >>> in the mesa stack. > >>> Rewriting this function to force aligned writes solved the problem and > >>> allows glmark2-drm to run to completion. > >>> With some extensive debugging, I found about half a dozen memcpy > >>> functions in mesa that if forced to be aligned would allow Wayland to > >>> start, but with hilarious display corruption (see [3]. [4]). > >>> The CM4 team is convinced this is an issue with memcpy in glibc, but > >>> I'm not convinced it's that simple. > >>> > >>> On my two hour drive in to work this morning, I got to thinking. > >>> If this was an memcpy fault, this would be universally broken on arm64 > >>> which is obviously not the case. > >>> So I started thinking, what is different here than with systems known to work: > >>> 1. No IOMMU for the PCIe controller. > >>> 2. The Outer Cache Issue. > >>> > >>> Robin: > >>> My questions for you, since you're the smartest person I know about > >>> arm64 memory management: > >>> Could cache snooping permit unaligned accesses to IO to be safe? > >> > >> No. > >> > >>> Or > >>> Is it the lack of an IOMMU that's causing the alignment faults to become fatal? > >> > >> No. > >> > >>> Or > >>> Am I insane here? > >> > >> No. (probably) > >> > >> CPU access to PCIe has nothing to do with PCIe's access to memory. From > >> what you've described, my guess is that a GPU BAR gets put in a > >> non-prefetchable window, such that it ends up mapped as Device memory > >> (whereas if it were prefetchable it would be Normal Non-Cacheable). > > > > Okay, this is perfect and I think you just put me on the right track > > for identifying the exact issue. Thanks! > > > > I've sliced up the non-prefetchable window and given it a prefetchable window. > > The 256MB BAR now resides in that window. > > However I'm still getting bus errors, so it seems the prefetch isn't > > actually happening. > > Note that "prefetchable" really just means "no side-effects on reads", > i.e. we can map it with a Normal memory type that technically *allows* > the CPU to make speculative accesses because they will not be harmful, > but that's not to say the CPU will do so. Just that if it did, you > wouldn't notice anyway. > > It's entirely possible that the PCIe IP itself doesn't like unaligned > accesses, so changing the memory type just moves you from an alignment > fault to an external abort. Okay, I've tried setting up PL_COHERENCY_CONTROL_3_OFF, where AxCACHE can be forced from auto to predefined for reads and writes. As I understand it, the cache bit should permit characteristic mismatch to be accepted and prefetch to be enabled, when combined with the read/write bits. It doesn't seem to make a difference however. I got the idea to look for this from the Armada8K and Tegra drivers. It would be nice to know if dGPUs work at all on *any* DWC based PCIe controllers. We could use those as a starting point to find out what's broken here. > > > The difference is now the GPU realizes that an error has happened and > > initiates recovery, vice before where it seemed to be clueless. > > If I understand everything correctly, that's because before the bus > > error was raised by the CPU due to the memory flag, vice now where > > it's actually the bus raising the alarm. > > > > My next question, is this something the driver should set and isn't, > > or is it just because of the broken cache coherency? > > The general rule for userspace mmap()ing PCIe-attached memory and > handing it off to glibc or anyone else who might assume it's regular > system RAM is "don't do that". If it's not access size or alignment that > falls over, it could be atomic operations, MTE tags, or any other > new-fangled memory innovation. For the ultimate dream of just plugging > in a card full of RAM, you either need to look back to ISA or forward to > CXL ;) So either go back to the really old way of doing things, find and fix the underlying problem, or wait for the IP to catch up? > > Robin. Thanks! Peter