On Thu, Nov 29, 2018 at 09:24:17AM -0800, Tomasz Figa wrote: > [CC Marek] > > On Thu, Nov 29, 2018 at 9:09 AM Daniel Vetter <daniel@xxxxxxxx> wrote: > > > > On Thu, Nov 29, 2018 at 5:57 PM Christoph Hellwig <hch@xxxxxx> wrote: > > > On Thu, Nov 29, 2018 at 05:28:07PM +0100, Daniel Vetter wrote: > > > > Just spend a bit of time reading through the implementations already > > > > merged. Is the struct device *dev parameter actually needed anywhere? > > > > dma-api definitely needs it, because we need that to pick the right iommu. > > > > But for cache management from what I've seen the target device doesn't > > > > matter, all the target specific stuff will be handled by the iommu. > > > > > > It looks like only mips every uses the dev argument, and even there > > > the function it passes dev to ignores it. So it could probably be removed. > > > > > Whether the cache maintenance operation needs to actually do anything > or not is a function of `dev`. We can have some devices that are > coherent with CPU caches, and some that are not, on the same system. So the thing is that the gpu driver knows this too. It fairly often can even decide at runtime (through surface state bits or gpu pte bits) whether to use coherent or non-coherent transactions. dma-api assuming that a given device is always coherent or non-coherent is one of the fundamental mismatches we have. If you otoh need dev because there's some strange bridge caches you need to flush (I've never seen that, but who knows), that would be a diffeernt thing. All the bridge flushing I've seen is attached to the iommu though, so would be really a surprise if the cache management needs that too. > > > > > > > > Dropping the dev parameter would make this a perfect fit for coherency > > > > management of buffers used by multiple devices. Right now there's all > > > > kinds of nasty tricks for that use cases needed to avoid redundant > > > > flushes. > > > > > > Note that one thing I'd like to avoid is exposing these funtions directly > > > to drivers, as that will get us into all kinds of abuses. > > > > What kind of abuse do you expect? It could very well be that gpu folks > > call that "standard use case" ... At least on x86 with the i915 driver > > we pretty much rely on architectural guarantees for how cache flushes > > work very much. Down to userspace doing the cache flushing for > > mappings the kernel has set up. > > i915 is a very specific case of a fully contained, > architecture-specific hardware subsystem, where you can just hardcode > all integration details inside the driver, because nobody else would > care. Nah, it's not fully contained, it's just with your arm eyewear on you can't see how badly it leaks all over the place. The problem is that GPUs (in many cases at least) want to decide whether and when to do cache maintenance. We need a combo of iommu and cache maintenance apis that allows this, across multiple devices, because the dma-api doesn't cut it. Aside: The cache maintenance api _must_ do the flush when asked to, even if it believes no cache maintenance is necessary. Even if the default mode is coherent for a platform/dev combo, the gpu driver might want to do a non-coherent transaction. > In ARM world, you can have the same IP blocks licensed by multiple SoC > vendors with different integration details and that often includes the > option of coherency. My experience is that for soc gpus, you need to retune up/download code for every soc. Or at least every family of socs. This is painful. I guess thus far the arm soc gpu drivers we have merged aren't the ones that needed widely different tuning on different soc families/ranges. What's worse, it's userspace who will decide whether to use the coherent or non-coherent paths in many cases (that's at least how it worked since decades on the big gpus, small gpus just lag a few years usually). The kerenl only provides the tools for the userspace opengl/vulkan driver to do all the things. Trying to hide coherent vs. non-coherent like it's done for everyone else is probably not going to work for gpus. In fact, hasn't ever worked for gpus thus far, and unlikely to change I think. -Daniel > > > So I'd much prefer if we could have iommu APIs wrapping these that are > > > specific to actual uses cases that we understand well. > > > > > > As for the buffer sharing: at least for the DMA API side I want to > > > move the current buffer sharing users away from dma_alloc_coherent > > > (and coherent dma_alloc_attrs users) and the remapping done in there > > > required for non-coherent architectures. Instead I'd like to allocate > > > plain old pages, and then just dma map them for each device separately, > > > with DMA_ATTR_SKIP_CPU_SYNC passed for all but the first user to map > > > or last user to unmap. On the iommu side it could probably work > > > similar. > > > > I think this is what's often done. Except then there's also the issue > > of how to get at the cma allocator if your device needs something > > contiguous. There's a lot of that still going on in graphics/media. > > I suppose one could just expose CMA with the default pool directly. > It's just an allocator, so not sure why it would need any > device-specific information. > > There is also the use case of using CMA with device-specific pools of > memory reusable by the system when not used by the device and those > would have to somehow get the pool to allocate from, but I wonder if > struct device is the right way to pass such information. I'd see the > pool given explicitly like cma_alloc(struct cma_pool *, size, flags) > and perhaps a wrapper cma_alloc_default(size, flags) that is just a > simple macro calling cma_alloc(&cma_pool_default, size, flags). > > Best regards, > Tomasz -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch