Re: [PATCH] drm: gem: add an option for supporting the dma-coherent hardware.

Maxime Ripard <mripard@xxxxxxxxxx> · Mon, 12 Jun 2023 17:15:03 +0200

On Thu, Jun 08, 2023 at 04:17:52PM +0800, Sui Jingfeng wrote:
> Hi,
> 
> On 2023/6/8 15:39, Maxime Ripard wrote:
> > On Thu, Jun 08, 2023 at 01:18:38AM +0800, Sui Jingfeng wrote:
> > > Hi,
> > > 
> > > On 2023/6/8 00:12, Paul Cercueil wrote:
> > > > Hi Sui,
> > > > 
> > > > Le mercredi 07 juin 2023 à 22:38 +0800, Sui Jingfeng a écrit :
> > > > > Hi,  welcome to discussion.
> > > > > 
> > > > > 
> > > > > I have limited skills in manipulating English.
> > > > > 
> > > > > It may not express what I'm really means in the short time.
> > > > > 
> > > > > Part of word in the sentence may not as accurate as your.
> > > > > 
> > > > > Well, please don't misunderstand, I'm not doing the rude to you.
> > > > No problem.
> > > > 
> > > > > I will explain it with more details.
> > > > > 
> > > > > See below:
> > > > > 
> > > > > 
> > > > > On 2023/6/7 20:09, Paul Cercueil wrote:
> > > > > > Hi Sui,
> > > > > > 
> > > > > > Le mercredi 07 juin 2023 à 18:30 +0800, Sui Jingfeng a écrit :
> > > > > > > Hi,
> > > > > > > 
> > > > > > > 
> > > > > > > On 2023/6/7 17:36, Paul Cercueil wrote:
> > > > > > > > Hi Sui,
> > > > > > > > 
> > > > > > > > Le mercredi 07 juin 2023 à 13:30 +0800, Sui Jingfeng a écrit :
> > > > > > > > > The single map_noncoherent member of struct
> > > > > > > > > drm_gem_dma_object
> > > > > > > > > may
> > > > > > > > > not
> > > > > > > > > sufficient for describing the backing memory of the GEM
> > > > > > > > > buffer
> > > > > > > > > object.
> > > > > > > > > 
> > > > > > > > > Especially on dma-coherent systems, the backing memory is
> > > > > > > > > both
> > > > > > > > > cached
> > > > > > > > > coherent for multi-core CPUs and dma-coherent for peripheral
> > > > > > > > > device.
> > > > > > > > > Say architectures like X86-64, LoongArch64, Loongson Mips64,
> > > > > > > > > etc.
> > > > > > > > > 
> > > > > > > > > Whether a peripheral device is dma-coherent or not can be
> > > > > > > > > implementation-dependent. The single map_noncoherent option
> > > > > > > > > is
> > > > > > > > > not
> > > > > > > > > enough
> > > > > > > > > to reflect real hardware anymore. For example, the Loongson
> > > > > > > > > LS3A4000
> > > > > > > > > CPU
> > > > > > > > > and LS2K2000/LS2K1000 SoC, peripheral device of such hardware
> > > > > > > > > platform
> > > > > > > > > allways snoop CPU's cache. Doing the allocation with
> > > > > > > > > dma_alloc_coherent
> > > > > > > > > function is preferred. The return buffer is cached, it should
> > > > > > > > > not
> > > > > > > > > using
> > > > > > > > > the default write-combine mapping. While with the current
> > > > > > > > > implement,
> > > > > > > > > there
> > > > > > > > > no way to tell the drm core to reflect this.
> > > > > > > > > 
> > > > > > > > > This patch adds cached and coherent members to struct
> > > > > > > > > drm_gem_dma_object.
> > > > > > > > > which allow driver implements to inform the core. Introducing
> > > > > > > > > new
> > > > > > > > > mappings
> > > > > > > > > while keeping the original default behavior unchanged.
> > > > > > > > Did you try to simply set the "dma-coherent" property to the
> > > > > > > > device's
> > > > > > > > node?
> > > > > > > But this approach can only be applied for the device driver with
> > > > > > > DT
> > > > > > > support.
> > > > > > > 
> > > > > > > X86-64, Loongson ls3a4000 mips64, Loongson ls3a5000 CPU typically
> > > > > > > do
> > > > > > > not
> > > > > > > have DT support.
> > > > > > > 
> > > > > > > They using ACPI to pass parameter from the firmware to Linux
> > > > > > > kernel.
> > > > > > > 
> > > > > > > You approach will lost the effectiveness on such a case.
> > > > > > Well, I don't really know how ACPI handles it - but it should just
> > > > > > be a
> > > > > > matter of setting dev->dma_coherent. That's basically what the DT
> > > > > > code
> > > > > > does.
> > > > > > 
> > > > > > Some MIPS boards set it in their setup code for instance.
> > > > > > 
> > > > > This is a *strategy*, not a *mechanism*.
> > > > > 
> > > > > In this case, DT is just used to describing the hardware.
> > > > > 
> > > > > (It is actually a hardware feature describing language, the
> > > > > granularity
> > > > > is large)
> > > > > 
> > > > > It does not changing the state of the hardware.
> > > > > 
> > > > > It's your platform firmware or kernel setting up code who actually do
> > > > > such a things.
> > > > > 
> > > > > 
> > > > > It's just that it works on *one* platform, it does not guarantee it
> > > > > will
> > > > > works on others.
> > > > If you add the "dma-coherent" property in a device node in DT, you
> > > > effectively specify that the device is DMA-coherent; so you describe
> > > > the hardware, which is what DT is for, and you are not changing the
> > > > state of the hardware.
> > > > 
> > > > Note that some MIPS platforms (arch/mips/alchemy/common/setup.c)
> > > > default to DMA-coherent mapping; I believe you could do something
> > > > similar with your Loongson LS3A4000 CPU and LS2K2000/LS2K1000 SoC.
> > > > 
> > > The preblem is that device driver can have various demand.
> > > 
> > > It probably want to create different kind of buffers for different thing
> > > simultaneously.
> > > 
> > > Say, one allocated with dma_alloc_coherent for command buffer or dma
> > > descriptor
> > > 
> > > another one allocated with  dma_alloc_wc for uploading shader etc.
> > > 
> > > also has the third one allocated with dma_alloc_noncoherent() for doing some
> > > else.
> > And it will work just fine.
> > 
> > struct device dma_coherent, or DT's dma-coherent property define that
> > the device doesn't need any kind of cache maintenance, ever. If it's
> > missing, we need to perform cache maintenance to keep coherency.
> > 
> > dma_alloc_* functions provide guarantees to the driver. With
> > dma_alloc_wc and dma_alloc_coherent, the buffer is coherent, and thus
> > you don't need to perform cache maintenance operations by hand in the
> > driver.
> 
> BO returned by dma_alloc_wc() doesn't works on some platform.

What do you mean by "it doesn't work" ? Because then, it's definitely
the platform that's broken if it doesn't provide a coherent buffer.

> This may only guarantee for the CPU side. There is no guarantee for
> the GPU side.
> 
> For example, the GPU always snoop CPU's cache. The GPU fetch data from
> the CPU's cache if hit.
> 
> if not hit, the GPU fetch the data from the system RAM.

Right, that's still a coherent buffer.

> But when call dma_alloc_wc(), the BO at cpu side is marked as write
> combine property.
> 
> The write buffer within the CPU will gather the CPU side write access.
> 
> This is to say, there may have some data reside(stall) in the write buffer.
> 
> while the GPU will fetch data from the system RAM or CPU's cache.
> 
> the GPU will fetch wrong data.

If that's the case, your buffer isn't coherent, and your platform
probably breaks the expectations of the DMA API.

> This is the condition for our hardware, I don't know how does the ARM
> platform guarantee
> 
> the coherency in this case.
>
> If it relay on software to guarantee, then it is still non hardware
> maintained coherency.
> 
> 
> When it relay on software, I called it implement-dependent.
> 
> there are some archs without the implement or don't know how to implement.
> 
> 
> If it can't even snoop cpu's cache, I don't believe it can snoop cpu's write
> buffer.
> 
> I not sure dma api can do guarantee for all arch.
> 
> 
> > With dma_alloc_noncoherent, the buffer is non-coherent and the driver
> > needs to perform them when relevant.
> > 
> > How those buffers are created is platform specific, but the guarantees
> > provided *to the driver* are always there.
> > 
> > A buffer allocated with dma_alloc_coherent might be provided by
> > different means (at the hardware level with a coherency unit, by mapping
> > it non-cacheable), but as far as the driver is concerned it's always
> > going to be coherent.
> > 
> > Similarly, a driver using dma_alloc_noncoherent will always require
> > cache maintenance operations to use the API properly, even if the
> > hardware provides coherency (in which case, those operations will be
> > nop).
> > 
> > So, yeah, like I was saying in the other mail, it looks like you're
> > confusing a bunch of things. dma_alloc_* functions are about the driver
> > expectations and guarantees. DT's dma-coherent property is about how we
> > can implement them on a given platform.
> 
> That is ideal situation.
> 
> You don't have seen the actual bugs.

Probably not :)

> Yeah, I do have a bit confusing about the DMA api.
> 
> Maybe you and Paul can continue work on this.

As far as I'm concerned, there's nothing to fix but your platform
support. I won't work on that.

> But DT's dma-coherent property is definitely not a system level
> solution.
> 
> drm/amdgpu, drm/radeon and drm/i915 don't support DT.
> 
> If ARM is dma-noncoherent, I suspect drm/amdgpu, drm/radeon will not
> works on ARM.
> 
> there no function call dma_sync_for_device() dma_sync_for_cpu() etc
> 
> These driver assume dma-coherent hardware.

No, these drivers assume they have a dma-coherent *buffer*. Which on
devices that don't get hardware coherency mean that they probably get a
non-cacheable buffer.

Maxime
Attachment:
signature.asc

Description: PGP signature