Hi Sui, Le mercredi 07 juin 2023 à 22:38 +0800, Sui Jingfeng a écrit : > Hi, welcome to discussion. > > > I have limited skills in manipulating English. > > It may not express what I'm really means in the short time. > > Part of word in the sentence may not as accurate as your. > > Well, please don't misunderstand, I'm not doing the rude to you. No problem. > > I will explain it with more details. > > See below: > > > On 2023/6/7 20:09, Paul Cercueil wrote: > > Hi Sui, > > > > Le mercredi 07 juin 2023 à 18:30 +0800, Sui Jingfeng a écrit : > > > Hi, > > > > > > > > > On 2023/6/7 17:36, Paul Cercueil wrote: > > > > Hi Sui, > > > > > > > > Le mercredi 07 juin 2023 à 13:30 +0800, Sui Jingfeng a écrit : > > > > > The single map_noncoherent member of struct > > > > > drm_gem_dma_object > > > > > may > > > > > not > > > > > sufficient for describing the backing memory of the GEM > > > > > buffer > > > > > object. > > > > > > > > > > Especially on dma-coherent systems, the backing memory is > > > > > both > > > > > cached > > > > > coherent for multi-core CPUs and dma-coherent for peripheral > > > > > device. > > > > > Say architectures like X86-64, LoongArch64, Loongson Mips64, > > > > > etc. > > > > > > > > > > Whether a peripheral device is dma-coherent or not can be > > > > > implementation-dependent. The single map_noncoherent option > > > > > is > > > > > not > > > > > enough > > > > > to reflect real hardware anymore. For example, the Loongson > > > > > LS3A4000 > > > > > CPU > > > > > and LS2K2000/LS2K1000 SoC, peripheral device of such hardware > > > > > platform > > > > > allways snoop CPU's cache. Doing the allocation with > > > > > dma_alloc_coherent > > > > > function is preferred. The return buffer is cached, it should > > > > > not > > > > > using > > > > > the default write-combine mapping. While with the current > > > > > implement, > > > > > there > > > > > no way to tell the drm core to reflect this. > > > > > > > > > > This patch adds cached and coherent members to struct > > > > > drm_gem_dma_object. > > > > > which allow driver implements to inform the core. Introducing > > > > > new > > > > > mappings > > > > > while keeping the original default behavior unchanged. > > > > Did you try to simply set the "dma-coherent" property to the > > > > device's > > > > node? > > > But this approach can only be applied for the device driver with > > > DT > > > support. > > > > > > X86-64, Loongson ls3a4000 mips64, Loongson ls3a5000 CPU typically > > > do > > > not > > > have DT support. > > > > > > They using ACPI to pass parameter from the firmware to Linux > > > kernel. > > > > > > You approach will lost the effectiveness on such a case. > > Well, I don't really know how ACPI handles it - but it should just > > be a > > matter of setting dev->dma_coherent. That's basically what the DT > > code > > does. > > > > Some MIPS boards set it in their setup code for instance. > > > This is a *strategy*, not a *mechanism*. > > In this case, DT is just used to describing the hardware. > > (It is actually a hardware feature describing language, the > granularity > is large) > > It does not changing the state of the hardware. > > It's your platform firmware or kernel setting up code who actually do > such a things. > > > It's just that it works on *one* platform, it does not guarantee it > will > works on others. If you add the "dma-coherent" property in a device node in DT, you effectively specify that the device is DMA-coherent; so you describe the hardware, which is what DT is for, and you are not changing the state of the hardware. Note that some MIPS platforms (arch/mips/alchemy/common/setup.c) default to DMA-coherent mapping; I believe you could do something similar with your Loongson LS3A4000 CPU and LS2K2000/LS2K1000 SoC. > While my patch is trying to create a *mechanism* which could probably > > works on all platform. > > > It is based the patch you have already commuted. > > Thanks for your excellent contribution. > > > > > > From what I understand if you add that property then Linux > > > > will > > > > use DMA > > > > coherent memory even though you use dma_alloc_noncoherent() and > > > > the > > > > sync_single_for_cpu() / sync_single_for_device() are then NOPs. > > > Please do not mitigate the problems with confusing method. > > > > > > > > > This approach not only tend to generate confusion but also > > > implement-dependent > > > > > > and arch-dependent. It's definitely problematic. > > > > > > > > > How does the dma_alloc_coherent/dma_alloc_noncoherent is a ARCH > > > specific > > > thing. > > > > > > Dependent on how does the arch_dma_ops is implemented. > > > > > > > > > The definition of the coherent on different ARCH has different > > > meanings. > > > > > > The definition of the wirte-combine on different ARCH has > > > different > > > meanings. > > > > > > > > > The wirte-combine(uncache acceleration) on mips is non dma- > > > coherent. > > It is dma-coherent on Ingenic SoCs. > > > > > It is dma-coherent ? How does it achieve it? > > > As far as I know, there is a write buffer within the mips cpu. > > typically 64 byte, but it is not cache. It will gather the CPU write > access, > > When a peripheral device do the DMA, how does you platform guarantee > > the data in the CPU write buffer has been already arrived at (or > flushed > out to) > > the system RAM? > > > Does the peripheral device snoop the CPU's write buffer, > > or it need manually flush the write buffer with SYNC instruction? I believe the DMA flushes the write buffer? I don't actually know the details, it would be something to ask to Ingenic. > > > > But on arm, It seem that wirte-combine is coherent. (guaranteed > > > by > > > arch > > > implement). > > > > > > > > > I also heard using dma_alloc_coherent to allocation the buffer > > > for > > > the > > > non-coherent doesn't hurt, but the reverse is not true. > > > > > > > > > But please do not create confusion. > > > > > > software composite is faster because better cacheusing rate and > > > > > > cache is faster to read. > > > > > > It is faster because it is cached, not because it is non- > > > coherent. > > > > > > non-coherent is arch thing and/or driver-side thing, > > > > > > it is a side effect of using the cached mapping. > > Yes, I know that. > > > > > > > > It should left to driver to handle such a side effect. The device > > > driver > > > > > > know their device, so its the device driver's responsibility to > > > maintain > > > the coherency. On loongson platform, we don't need to call > > > drm_fb_dma_sync_non_coherent() function, Its already guaranteed > > > by > > > hardware. > > I understand. What I'm saying, is that you should be able to set > > dma_obj->map_noncoherent (which would arguably be better named > > "map_cached", > > My point is that the word *cached* reflect the nature, > > dma-coherent or dma-noncoherent is secondary. > > We are all on the way to pursue the performance. > > In the end, it is the cache give us the speed. > > > Why not we credit the cache hardware inside of the CPU? dma_alloc_noncoherent() gives you *cached* memory. Therefore, if you want *cached* memory, you should set gem->map_noncoherent. I understand your confusion; it would be easier to understand if this function was called dma_alloc_cached(). Then, if the memory is actually DMA-coherent for the device (dev- >dma_coherent == true), the drm_fb_dma_sync_non_coherent() function is a no-op. But in both cases (DMA-coherent device, non DMA-coherent device), if you want cached buffers, you should call dma_alloc_noncoherent(). > > but that's a different problem). Then the GEM code would > > end up calling dma_alloc_noncoherent(), which will give you > > *cached* > > memory. Then as long as dev->dma_coherent = true, > > drm_fb_dma_sync_non_coherent() should be a NOP - so you wouldn't > > pointlessly sync/invalidate the caches. > > > > And I disagree with you, the driver shouldn't handle such things. > > You already handle the side effect of such things, See below: > > > ``` > > if (ingenic_drm_map_noncoherent(ipu->master)) > drm_fb_dma_sync_non_coherent(ipu->drm, oldstate, newstate); > > ``` > > By the way, Ingenic is the only driver in the drivers/gpu/drm/ that > handle such things > > so far. Yes; and now I think that this was a bad idea (for the reasons Maxime listed in his email). > > > The > > fact that it is better to use cached memory or uncached with write- > > combine really is platform-specific and not something that the > > driver > > should be aware of. > > But the fact is that, It is drm/ingenic tell the drm core, some SoC > is > prefer cached, > > but unable to enforce the coherent. So that it need flush the cache > manually. > > What do you meant by saying that the driver should not be aware of ? Ideally, the driver should just call a function "dma_alloc_buffer", which would return cached memory when it makes sense, otherwise a uncached buffer with the write-combine attribute. Then the arch code (or DT) can decide what's the best setting, and not the driver. In the meantime, you should use gem->dma_noncoherent like the ingenic- drm driver does - until somebody (probably me) refactor things. Cheers, -Paul > > > Cheers, > > -Paul > > > > > > > > > Cheers, > > > > -Paul > > > > > > > > > Signed-off-by: Sui Jingfeng <suijingfeng@xxxxxxxxxxx> > > > > > --- > > > > > drivers/gpu/drm/drm_fb_dma_helper.c | 11 +++++------ > > > > > drivers/gpu/drm/drm_fbdev_dma.c | 2 +- > > > > > drivers/gpu/drm/drm_gem_dma_helper.c | 20 > > > > > ++++++++++++++++---- > > > > > drivers/gpu/drm/ingenic/ingenic-drm-drv.c | 5 ++++- > > > > > drivers/gpu/drm/rcar-du/Kconfig | 2 -- > > > > > drivers/gpu/drm/rcar-du/rcar_du_kms.c | 4 +++- > > > > > include/drm/drm_gem_dma_helper.h | 7 +++++-- > > > > > 7 files changed, 34 insertions(+), 17 deletions(-) > > > > > > > > > > diff --git a/drivers/gpu/drm/drm_fb_dma_helper.c > > > > > b/drivers/gpu/drm/drm_fb_dma_helper.c > > > > > index 3b535ad1b07c..93ff05041192 100644 > > > > > --- a/drivers/gpu/drm/drm_fb_dma_helper.c > > > > > +++ b/drivers/gpu/drm/drm_fb_dma_helper.c > > > > > @@ -106,16 +106,15 @@ dma_addr_t > > > > > drm_fb_dma_get_gem_addr(struct > > > > > drm_framebuffer *fb, > > > > > EXPORT_SYMBOL_GPL(drm_fb_dma_get_gem_addr); > > > > > > > > > > /** > > > > > - * drm_fb_dma_sync_non_coherent - Sync GEM object to non- > > > > > coherent > > > > > backing > > > > > - * memory > > > > > + * drm_fb_dma_sync_non_coherent - Sync GEM object to cached > > > > > backing > > > > > memory > > > > > * @drm: DRM device > > > > > * @old_state: Old plane state > > > > > * @state: New plane state > > > > > * > > > > > * This function can be used by drivers that use damage > > > > > clips > > > > > and > > > > > have > > > > > - * DMA GEM objects backed by non-coherent memory. Calling > > > > > this > > > > > function > > > > > - * in a plane's .atomic_update ensures that all the data in > > > > > the > > > > > backing > > > > > - * memory have been written to RAM. > > > > > + * DMA GEM objects backed by cached memory. Calling this > > > > > function in > > > > > a > > > > > + * plane's .atomic_update ensures that all the data in the > > > > > backing > > > > > memory > > > > > + * have been written to RAM. > > > > > */ > > > > > void drm_fb_dma_sync_non_coherent(struct drm_device *drm, > > > > > struct drm_plane_state > > > > > *old_state, > > > > > @@ -131,7 +130,7 @@ void drm_fb_dma_sync_non_coherent(struct > > > > > drm_device *drm, > > > > > > > > > > for (i = 0; i < finfo->num_planes; i++) { > > > > > dma_obj = drm_fb_dma_get_gem_obj(state->fb, > > > > > i); > > > > > - if (!dma_obj->map_noncoherent) > > > > > + if (dma_obj->cached && dma_obj->coherent) > > > > > continue; > > > > > > > > > > daddr = drm_fb_dma_get_gem_addr(state->fb, > > > > > state, i); > > > > > diff --git a/drivers/gpu/drm/drm_fbdev_dma.c > > > > > b/drivers/gpu/drm/drm_fbdev_dma.c > > > > > index d86773fa8ab0..49fe9b284cc8 100644 > > > > > --- a/drivers/gpu/drm/drm_fbdev_dma.c > > > > > +++ b/drivers/gpu/drm/drm_fbdev_dma.c > > > > > @@ -131,7 +131,7 @@ static int > > > > > drm_fbdev_dma_helper_fb_probe(struct > > > > > drm_fb_helper *fb_helper, > > > > > > > > > > /* screen */ > > > > > info->flags |= FBINFO_VIRTFB; /* system memory */ > > > > > - if (dma_obj->map_noncoherent) > > > > > + if (dma_obj->cached) > > > > > info->flags |= FBINFO_READS_FAST; /* signal > > > > > caching > > > > > */ > > > > > info->screen_size = sizes->surface_height * fb- > > > > > > pitches[0]; > > > > > info->screen_buffer = map.vaddr; > > > > > diff --git a/drivers/gpu/drm/drm_gem_dma_helper.c > > > > > b/drivers/gpu/drm/drm_gem_dma_helper.c > > > > > index 870b90b78bc4..dec1d512bdf1 100644 > > > > > --- a/drivers/gpu/drm/drm_gem_dma_helper.c > > > > > +++ b/drivers/gpu/drm/drm_gem_dma_helper.c > > > > > @@ -93,7 +93,11 @@ __drm_gem_dma_create(struct drm_device > > > > > *drm, > > > > > size_t size, bool private) > > > > > drm_gem_private_object_init(drm, gem_obj, > > > > > size); > > > > > > > > > > /* Always use writecombine for dma-buf > > > > > mappings > > > > > */ > > > > > - dma_obj->map_noncoherent = false; > > > > > + /* FIXME: This is not always true, on some > > > > > dma > > > > > coherent system, > > > > > + * cached mappings should be preferred over > > > > > writecombine > > > > > + */ > > > > > + dma_obj->cached = false; > > > > > + dma_obj->coherent = false; > > > > > } else { > > > > > ret = drm_gem_object_init(drm, gem_obj, > > > > > size); > > > > > } > > > > > @@ -143,7 +147,11 @@ struct drm_gem_dma_object > > > > > *drm_gem_dma_create(struct drm_device *drm, > > > > > if (IS_ERR(dma_obj)) > > > > > return dma_obj; > > > > > > > > > > - if (dma_obj->map_noncoherent) { > > > > > + if (dma_obj->cached && dma_obj->coherent) { > > > > > + dma_obj->vaddr = dma_alloc_coherent(drm->dev, > > > > > size, > > > > > + &dma_obj- > > > > > > dma_addr, > > > > > + > > > > > GFP_KERNEL | > > > > > __GFP_NOWARN); > > > > > + } else if (dma_obj->cached && !dma_obj->coherent) { > > > > > dma_obj->vaddr = dma_alloc_noncoherent(drm- > > > > > >dev, > > > > > size, > > > > > > > > > > &dma_obj- > > > > > > dma_addr, > > > > > > > > > > DMA_TO_DEVICE, > > > > > @@ -153,6 +161,7 @@ struct drm_gem_dma_object > > > > > *drm_gem_dma_create(struct drm_device *drm, > > > > > &dma_obj- > > > > > > dma_addr, > > > > > GFP_KERNEL | > > > > > __GFP_NOWARN); > > > > > } > > > > > + > > > > > if (!dma_obj->vaddr) { > > > > > drm_dbg(drm, "failed to allocate buffer > > > > > with > > > > > size > > > > > %zu\n", > > > > > size); > > > > > @@ -233,7 +242,10 @@ void drm_gem_dma_free(struct > > > > > drm_gem_dma_object > > > > > *dma_obj) > > > > > dma_buf_vunmap_unlocked(gem_obj- > > > > > > import_attach->dmabuf, &map); > > > > > drm_prime_gem_destroy(gem_obj, dma_obj- > > > > > >sgt); > > > > > } else if (dma_obj->vaddr) { > > > > > - if (dma_obj->map_noncoherent) > > > > > + if (dma_obj->cached && dma_obj->coherent) > > > > > + dma_free_coherent(gem_obj->dev->dev, > > > > > dma_obj- > > > > > > base.size, > > > > > + dma_obj->vaddr, > > > > > dma_obj- > > > > > > dma_addr); > > > > > + else if (dma_obj->cached && !dma_obj- > > > > > >coherent) > > > > > dma_free_noncoherent(gem_obj->dev- > > > > > >dev, > > > > > dma_obj->base.size, > > > > > dma_obj- > > > > > >vaddr, > > > > > dma_obj- > > > > > > dma_addr, > > > > > > > > > > DMA_TO_DEVICE); > > > > > @@ -532,7 +544,7 @@ int drm_gem_dma_mmap(struct > > > > > drm_gem_dma_object > > > > > *dma_obj, struct vm_area_struct * > > > > > vma->vm_pgoff -= drm_vma_node_start(&obj- > > > > > >vma_node); > > > > > vm_flags_mod(vma, VM_DONTEXPAND, VM_PFNMAP); > > > > > > > > > > - if (dma_obj->map_noncoherent) { > > > > > + if (dma_obj->cached) { > > > > > vma->vm_page_prot = vm_get_page_prot(vma- > > > > > > vm_flags); > > > > > > > > > > ret = dma_mmap_pages(dma_obj->base.dev- > > > > > >dev, > > > > > diff --git a/drivers/gpu/drm/ingenic/ingenic-drm-drv.c > > > > > b/drivers/gpu/drm/ingenic/ingenic-drm-drv.c > > > > > index 5ec75e9ba499..a3df2f99a757 100644 > > > > > --- a/drivers/gpu/drm/ingenic/ingenic-drm-drv.c > > > > > +++ b/drivers/gpu/drm/ingenic/ingenic-drm-drv.c > > > > > @@ -919,7 +919,10 @@ ingenic_drm_gem_create_object(struct > > > > > drm_device > > > > > *drm, size_t size) > > > > > if (!obj) > > > > > return ERR_PTR(-ENOMEM); > > > > > > > > > > - obj->map_noncoherent = priv->soc_info- > > > > > >map_noncoherent; > > > > > + if (priv->soc_info->map_noncoherent) { > > > > > + obj->cached = true; > > > > > + obj->coherent = false; > > > > > + } > > > > > > > > > > return &obj->base; > > > > > } > > > > > diff --git a/drivers/gpu/drm/rcar-du/Kconfig > > > > > b/drivers/gpu/drm/rcar- > > > > > du/Kconfig > > > > > index 53c356aed5d5..dddc70c08bdc 100644 > > > > > --- a/drivers/gpu/drm/rcar-du/Kconfig > > > > > +++ b/drivers/gpu/drm/rcar-du/Kconfig > > > > > @@ -2,8 +2,6 @@ > > > > > config DRM_RCAR_DU > > > > > tristate "DRM Support for R-Car Display Unit" > > > > > depends on DRM && OF > > > > > - depends on ARM || ARM64 > > > > > - depends on ARCH_RENESAS || COMPILE_TEST > > > > > select DRM_KMS_HELPER > > > > > select DRM_GEM_DMA_HELPER > > > > > select VIDEOMODE_HELPERS > > > > > diff --git a/drivers/gpu/drm/rcar-du/rcar_du_kms.c > > > > > b/drivers/gpu/drm/rcar-du/rcar_du_kms.c > > > > > index adfb36b0e815..1142d51473e6 100644 > > > > > --- a/drivers/gpu/drm/rcar-du/rcar_du_kms.c > > > > > +++ b/drivers/gpu/drm/rcar-du/rcar_du_kms.c > > > > > @@ -386,7 +386,9 @@ struct drm_gem_object > > > > > *rcar_du_gem_prime_import_sg_table(struct drm_device *dev, > > > > > gem_obj->funcs = &rcar_du_gem_funcs; > > > > > > > > > > drm_gem_private_object_init(dev, gem_obj, attach- > > > > > > dmabuf- > > > > > > size); > > > > > - dma_obj->map_noncoherent = false; > > > > > + > > > > > + dma_obj->cached = false; > > > > > + dma_obj->coherent = false; > > > > > > > > > > ret = drm_gem_create_mmap_offset(gem_obj); > > > > > if (ret) { > > > > > diff --git a/include/drm/drm_gem_dma_helper.h > > > > > b/include/drm/drm_gem_dma_helper.h > > > > > index 8a043235dad8..585ce3d4d1eb 100644 > > > > > --- a/include/drm/drm_gem_dma_helper.h > > > > > +++ b/include/drm/drm_gem_dma_helper.h > > > > > @@ -16,7 +16,9 @@ struct drm_mode_create_dumb; > > > > > * more than one entry but they are guaranteed to > > > > > have > > > > > contiguous > > > > > * DMA addresses. > > > > > * @vaddr: kernel virtual address of the backing memory > > > > > - * @map_noncoherent: if true, the GEM object is backed by > > > > > non- > > > > > coherent memory > > > > > + * @cached: if true, the GEM object is backed by cached > > > > > memory > > > > > + * @coherent: This option only meaningful when a GEM object > > > > > is > > > > > cached. > > > > > + * If true, Sync the GEM object for DMA access is > > > > > not > > > > > required. > > > > > */ > > > > > struct drm_gem_dma_object { > > > > > struct drm_gem_object base; > > > > > @@ -26,7 +28,8 @@ struct drm_gem_dma_object { > > > > > /* For objects with DMA memory allocated by GEM DMA > > > > > */ > > > > > void *vaddr; > > > > > > > > > > - bool map_noncoherent; > > > > > + bool cached; > > > > > + bool coherent; > > > > > }; > > > > > > > > > > #define to_drm_gem_dma_obj(gem_obj) \ >