On Wed, Mar 16, 2022 at 8:14 PM Peter Geis <pgwipeout@xxxxxxxxx> wrote: > > Good Evening, > > I apologize for raising this email chain from the dead, but there have > been some developments that have introduced even more questions. > I've looped the Rockchip mailing list into this too, as this affects > rk356x, and likely the upcoming rk3588 if [1] is to be believed. > > TLDR for those not familiar: It seems the rk356x series (and possibly > the rk3588) were built without any outer coherent cache. > This means (unless Rockchip wants to clarify here) devices such as the > ITS and PCIe cannot utilize cache snooping. > This is based on the results of the email chain [2]. > > The new circumstances are as follows: > The RPi CM4 Adventure Team as I've taken to calling them has been > attempting to get a dGPU working with the very broken Broadcom > controller in the RPi CM4. > Recently they acquired a SoQuartz rk3566 module which is pin > compatible with the CM4, and have taken to trying it out as well. > > This is how I got involved. > It seems they found a trivial way to force the Radeon R600 driver to > use Non-Cached memory for everything. > This single line change, combined with using memset_io instead of > memset, allows the ring tests to pass and the card probes successfully > (minus the DMA limitations of the rk356x due to the 32 bit > interconnect). > I discovered using this method that we start having unaligned io > memory access faults (bus errors) when running glmark2-drm (running > glmark2 directly was impossible, as both X and Wayland crashed too > early). > I traced this to using what I thought at the time was an unsafe memcpy > in the mesa stack. > Rewriting this function to force aligned writes solved the problem and > allows glmark2-drm to run to completion. > With some extensive debugging, I found about half a dozen memcpy > functions in mesa that if forced to be aligned would allow Wayland to > start, but with hilarious display corruption (see [3]. [4]). > The CM4 team is convinced this is an issue with memcpy in glibc, but > I'm not convinced it's that simple. another similar datapoint for reference: https://gitlab.freedesktop.org/mesa/mesa/-/issues/3274 Alex > > On my two hour drive in to work this morning, I got to thinking. > If this was an memcpy fault, this would be universally broken on arm64 > which is obviously not the case. > So I started thinking, what is different here than with systems known to work: > 1. No IOMMU for the PCIe controller. > 2. The Outer Cache Issue. > > Robin: > My questions for you, since you're the smartest person I know about > arm64 memory management: > Could cache snooping permit unaligned accesses to IO to be safe? > Or > Is it the lack of an IOMMU that's causing the alignment faults to become fatal? > Or > Am I insane here? > > Rockchip: > Please update on the status for the Outer Cache errata for ITS services. > Please provide an answer to the errata of the PCIe controller, in > regard to cache snooping and buffering, for both the rk356x and the > upcoming rk3588. > > [1] https://github.com/JeffyCN/mirrors/commit/0b985f29304dcb9d644174edacb67298e8049d4f > [2] https://lore.kernel.org/lkml/871rbdt4tu.wl-maz@xxxxxxxxxx/T/ > [3] https://cdn.discordapp.com/attachments/926487797844541510/953414755970850816/unknown.png > [4] https://cdn.discordapp.com/attachments/926487797844541510/953424952042852422/unknown.png > > Thank you everyone for your time. > > Very Respectfully, > Peter Geis > > On Wed, May 26, 2021 at 7:21 AM Christian König > <christian.koenig@xxxxxxx> wrote: > > > > Hi Robin, > > > > Am 26.05.21 um 12:59 schrieb Robin Murphy: > > > On 2021-05-26 10:42, Christian König wrote: > > >> Hi Robin, > > >> > > >> Am 25.05.21 um 22:09 schrieb Robin Murphy: > > >>> On 2021-05-25 14:05, Alex Deucher wrote: > > >>>> On Tue, May 25, 2021 at 8:56 AM Peter Geis <pgwipeout@xxxxxxxxx> > > >>>> wrote: > > >>>>> > > >>>>> On Tue, May 25, 2021 at 8:47 AM Alex Deucher > > >>>>> <alexdeucher@xxxxxxxxx> wrote: > > >>>>>> > > >>>>>> On Tue, May 25, 2021 at 8:42 AM Peter Geis <pgwipeout@xxxxxxxxx> > > >>>>>> wrote: > > >>>>>>> > > >>>>>>> Good Evening, > > >>>>>>> > > >>>>>>> I am stress testing the pcie controller on the rk3566-quartz64 > > >>>>>>> prototype SBC. > > >>>>>>> This device has 1GB available at <0x3 0x00000000> for the PCIe > > >>>>>>> controller, which makes a dGPU theoretically possible. > > >>>>>>> While attempting to light off a HD7570 card I manage to get a > > >>>>>>> modeset > > >>>>>>> console, but ring0 test fails and disables acceleration. > > >>>>>>> > > >>>>>>> Note, we do not have UEFI, so all PCIe setup is from the Linux > > >>>>>>> kernel. > > >>>>>>> Any insight you can provide would be much appreciated. > > >>>>>> > > >>>>>> Does your platform support PCIe cache coherency with the CPU? I.e., > > >>>>>> does the CPU allow cache snoops from PCIe devices? That is required > > >>>>>> for the driver to operate. > > >>>>> > > >>>>> Ah, most likely not. > > >>>>> This issue has come up already as the GIC isn't permitted to snoop on > > >>>>> the CPUs, so I doubt the PCIe controller can either. > > >>>>> > > >>>>> Is there no way to work around this or is it dead in the water? > > >>>> > > >>>> It's required by the pcie spec. You could potentially work around it > > >>>> if you can allocate uncached memory for DMA, but I don't think that is > > >>>> possible currently. Ideally we'd figure out some way to detect if a > > >>>> particular platform supports cache snooping or not as well. > > >>> > > >>> There's device_get_dma_attr(), although I don't think it will work > > >>> currently for PCI devices without an OF or ACPI node - we could > > >>> perhaps do with a PCI-specific wrapper which can walk up and defer > > >>> to the host bridge's firmware description as necessary. > > >>> > > >>> The common DMA ops *do* correctly keep track of per-device coherency > > >>> internally, but drivers aren't supposed to be poking at that > > >>> information directly. > > >> > > >> That sounds like you underestimate the problem. ARM has unfortunately > > >> made the coherency for PCI an optional IP. > > > > > > Sorry to be that guy, but I'm involved a lot internally with our > > > system IP and interconnect, and I probably understand the situation > > > better than 99% of the community ;) > > > > I need to apologize, didn't realized who was answering :) > > > > It just sounded to me that you wanted to suggest to the end user that > > this is fixable in software and I really wanted to avoid even more > > customers coming around asking how to do this. > > > > > For the record, the SBSA specification (the closet thing we have to a > > > "system architecture") does require that PCIe is integrated in an > > > I/O-coherent manner, but we don't have any control over what people do > > > in embedded applications (note that we don't make PCIe IP at all, and > > > there is plenty of 3rd-party interconnect IP). > > > > So basically it is not the fault of the ARM IP-core, but people are just > > stitching together PCIe interconnect IP with a core where it is not > > supposed to be used with. > > > > Do I get that correctly? That's an interesting puzzle piece in the picture. > > > > >> So we are talking about a hardware limitation which potentially can't > > >> be fixed without replacing the hardware. > > > > > > You expressed interest in "some way to detect if a particular platform > > > supports cache snooping or not", by which I assumed you meant a > > > software method for the amdgpu/radeon drivers to call, rather than, > > > say, a website that driver maintainers can look up SoC names on. I'm > > > saying that that API already exists (just may need a bit more work). > > > Note that it is emphatically not a platform-level thing since > > > coherency can and does vary per device within a system. > > > > Well, I think this is not something an individual driver should mess > > with. What the driver should do is just express that it needs coherent > > access to all of system memory and if that is not possible fail to load > > with a warning why it is not possible. > > > > > > > > I wasn't suggesting that Linux could somehow make coherency magically > > > work when the signals don't physically exist in the interconnect - I > > > was assuming you'd merely want to do something like throw a big > > > warning and taint the kernel to help triage bug reports. Some drivers > > > like ahci_qoriq and panfrost simply need to know so they can program > > > their device to emit the appropriate memory attributes either way, and > > > rely on the DMA API to hide the rest of the difference, but if you > > > want to treat non-coherent use as unsupported because it would require > > > too invasive changes that's fine by me. > > > > Yes exactly that please. I mean not sure how panfrost is doing it, but > > at least the Vulkan userspace API specification requires devices to have > > coherent access to system memory. > > > > So even if I would want to do this it is simply not possible because the > > application doesn't tell the driver which memory is accessed by the > > device and which by the CPU. > > > > Christian. > > > > > > > > Robin. > >