On Thu, Jun 08, 2023 at 04:17:52PM +0800, Sui Jingfeng wrote: > Hi, > > On 2023/6/8 15:39, Maxime Ripard wrote: > > On Thu, Jun 08, 2023 at 01:18:38AM +0800, Sui Jingfeng wrote: > > > Hi, > > > > > > On 2023/6/8 00:12, Paul Cercueil wrote: > > > > Hi Sui, > > > > > > > > Le mercredi 07 juin 2023 à 22:38 +0800, Sui Jingfeng a écrit : > > > > > Hi, welcome to discussion. > > > > > > > > > > > > > > > I have limited skills in manipulating English. > > > > > > > > > > It may not express what I'm really means in the short time. > > > > > > > > > > Part of word in the sentence may not as accurate as your. > > > > > > > > > > Well, please don't misunderstand, I'm not doing the rude to you. > > > > No problem. > > > > > > > > > I will explain it with more details. > > > > > > > > > > See below: > > > > > > > > > > > > > > > On 2023/6/7 20:09, Paul Cercueil wrote: > > > > > > Hi Sui, > > > > > > > > > > > > Le mercredi 07 juin 2023 à 18:30 +0800, Sui Jingfeng a écrit : > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > > On 2023/6/7 17:36, Paul Cercueil wrote: > > > > > > > > Hi Sui, > > > > > > > > > > > > > > > > Le mercredi 07 juin 2023 à 13:30 +0800, Sui Jingfeng a écrit : > > > > > > > > > The single map_noncoherent member of struct > > > > > > > > > drm_gem_dma_object > > > > > > > > > may > > > > > > > > > not > > > > > > > > > sufficient for describing the backing memory of the GEM > > > > > > > > > buffer > > > > > > > > > object. > > > > > > > > > > > > > > > > > > Especially on dma-coherent systems, the backing memory is > > > > > > > > > both > > > > > > > > > cached > > > > > > > > > coherent for multi-core CPUs and dma-coherent for peripheral > > > > > > > > > device. > > > > > > > > > Say architectures like X86-64, LoongArch64, Loongson Mips64, > > > > > > > > > etc. > > > > > > > > > > > > > > > > > > Whether a peripheral device is dma-coherent or not can be > > > > > > > > > implementation-dependent. The single map_noncoherent option > > > > > > > > > is > > > > > > > > > not > > > > > > > > > enough > > > > > > > > > to reflect real hardware anymore. For example, the Loongson > > > > > > > > > LS3A4000 > > > > > > > > > CPU > > > > > > > > > and LS2K2000/LS2K1000 SoC, peripheral device of such hardware > > > > > > > > > platform > > > > > > > > > allways snoop CPU's cache. Doing the allocation with > > > > > > > > > dma_alloc_coherent > > > > > > > > > function is preferred. The return buffer is cached, it should > > > > > > > > > not > > > > > > > > > using > > > > > > > > > the default write-combine mapping. While with the current > > > > > > > > > implement, > > > > > > > > > there > > > > > > > > > no way to tell the drm core to reflect this. > > > > > > > > > > > > > > > > > > This patch adds cached and coherent members to struct > > > > > > > > > drm_gem_dma_object. > > > > > > > > > which allow driver implements to inform the core. Introducing > > > > > > > > > new > > > > > > > > > mappings > > > > > > > > > while keeping the original default behavior unchanged. > > > > > > > > Did you try to simply set the "dma-coherent" property to the > > > > > > > > device's > > > > > > > > node? > > > > > > > But this approach can only be applied for the device driver with > > > > > > > DT > > > > > > > support. > > > > > > > > > > > > > > X86-64, Loongson ls3a4000 mips64, Loongson ls3a5000 CPU typically > > > > > > > do > > > > > > > not > > > > > > > have DT support. > > > > > > > > > > > > > > They using ACPI to pass parameter from the firmware to Linux > > > > > > > kernel. > > > > > > > > > > > > > > You approach will lost the effectiveness on such a case. > > > > > > Well, I don't really know how ACPI handles it - but it should just > > > > > > be a > > > > > > matter of setting dev->dma_coherent. That's basically what the DT > > > > > > code > > > > > > does. > > > > > > > > > > > > Some MIPS boards set it in their setup code for instance. > > > > > > > > > > > This is a *strategy*, not a *mechanism*. > > > > > > > > > > In this case, DT is just used to describing the hardware. > > > > > > > > > > (It is actually a hardware feature describing language, the > > > > > granularity > > > > > is large) > > > > > > > > > > It does not changing the state of the hardware. > > > > > > > > > > It's your platform firmware or kernel setting up code who actually do > > > > > such a things. > > > > > > > > > > > > > > > It's just that it works on *one* platform, it does not guarantee it > > > > > will > > > > > works on others. > > > > If you add the "dma-coherent" property in a device node in DT, you > > > > effectively specify that the device is DMA-coherent; so you describe > > > > the hardware, which is what DT is for, and you are not changing the > > > > state of the hardware. > > > > > > > > Note that some MIPS platforms (arch/mips/alchemy/common/setup.c) > > > > default to DMA-coherent mapping; I believe you could do something > > > > similar with your Loongson LS3A4000 CPU and LS2K2000/LS2K1000 SoC. > > > > > > > The preblem is that device driver can have various demand. > > > > > > It probably want to create different kind of buffers for different thing > > > simultaneously. > > > > > > Say, one allocated with dma_alloc_coherent for command buffer or dma > > > descriptor > > > > > > another one allocated with dma_alloc_wc for uploading shader etc. > > > > > > also has the third one allocated with dma_alloc_noncoherent() for doing some > > > else. > > And it will work just fine. > > > > struct device dma_coherent, or DT's dma-coherent property define that > > the device doesn't need any kind of cache maintenance, ever. If it's > > missing, we need to perform cache maintenance to keep coherency. > > > > dma_alloc_* functions provide guarantees to the driver. With > > dma_alloc_wc and dma_alloc_coherent, the buffer is coherent, and thus > > you don't need to perform cache maintenance operations by hand in the > > driver. > > BO returned by dma_alloc_wc() doesn't works on some platform. What do you mean by "it doesn't work" ? Because then, it's definitely the platform that's broken if it doesn't provide a coherent buffer. > This may only guarantee for the CPU side. There is no guarantee for > the GPU side. > > For example, the GPU always snoop CPU's cache. The GPU fetch data from > the CPU's cache if hit. > > if not hit, the GPU fetch the data from the system RAM. Right, that's still a coherent buffer. > But when call dma_alloc_wc(), the BO at cpu side is marked as write > combine property. > > The write buffer within the CPU will gather the CPU side write access. > > This is to say, there may have some data reside(stall) in the write buffer. > > while the GPU will fetch data from the system RAM or CPU's cache. > > the GPU will fetch wrong data. If that's the case, your buffer isn't coherent, and your platform probably breaks the expectations of the DMA API. > This is the condition for our hardware, I don't know how does the ARM > platform guarantee > > the coherency in this case. > > If it relay on software to guarantee, then it is still non hardware > maintained coherency. > > > When it relay on software, I called it implement-dependent. > > there are some archs without the implement or don't know how to implement. > > > If it can't even snoop cpu's cache, I don't believe it can snoop cpu's write > buffer. > > I not sure dma api can do guarantee for all arch. > > > > With dma_alloc_noncoherent, the buffer is non-coherent and the driver > > needs to perform them when relevant. > > > > How those buffers are created is platform specific, but the guarantees > > provided *to the driver* are always there. > > > > A buffer allocated with dma_alloc_coherent might be provided by > > different means (at the hardware level with a coherency unit, by mapping > > it non-cacheable), but as far as the driver is concerned it's always > > going to be coherent. > > > > Similarly, a driver using dma_alloc_noncoherent will always require > > cache maintenance operations to use the API properly, even if the > > hardware provides coherency (in which case, those operations will be > > nop). > > > > So, yeah, like I was saying in the other mail, it looks like you're > > confusing a bunch of things. dma_alloc_* functions are about the driver > > expectations and guarantees. DT's dma-coherent property is about how we > > can implement them on a given platform. > > That is ideal situation. > > You don't have seen the actual bugs. Probably not :) > Yeah, I do have a bit confusing about the DMA api. > > Maybe you and Paul can continue work on this. As far as I'm concerned, there's nothing to fix but your platform support. I won't work on that. > But DT's dma-coherent property is definitely not a system level > solution. > > drm/amdgpu, drm/radeon and drm/i915 don't support DT. > > If ARM is dma-noncoherent, I suspect drm/amdgpu, drm/radeon will not > works on ARM. > > there no function call dma_sync_for_device() dma_sync_for_cpu() etc > > These driver assume dma-coherent hardware. No, these drivers assume they have a dma-coherent *buffer*. Which on devices that don't get hardware coherency mean that they probably get a non-cacheable buffer. Maxime
Attachment:
signature.asc
Description: PGP signature