On Fri, 7 Apr 2023 07:55:48 +0200 Christoph Hellwig <hch@xxxxxx> wrote: > On Tue, Mar 28, 2023 at 09:54:35AM +0200, Petr Tesarik wrote: > > I tend to agree here. However, it's the DMABUF design itself that causes > > some trouble. The buffer is allocated by the v3d driver, which does not > > have the restriction, so the DMA API typically allocates an address > > somewhere near the 4G boundary. Userspace then exports the buffer, sends > > it to another process as a file descriptor and imports it into the vc4 > > driver, which requires DMA below 1G. In the beginning, v3d had no idea > > that the buffer would be exported to userspace, much less that it would > > be later imported into vc4. > > Then we need to either: > > a) figure out a way to communicate these addressing limitations AFAICS this would require a complete overhaul of the dma-buf userspace API so that intended imports are communicated at export time. In other words, it would be quite intrusive. Not my preferrence. > b) find a way to migrate a buffer into other memory, similar to > how page migration works for page cache Let me express the idea in my own words to make sure I get it right. When a DMA buffer is imported, but before it is ultimately pinned in memory, the importing device driver checks whether the buffer meets its DMA constraints. If not, it calls a function provided by the exporting device driver to migrate the buffer. This makes sense, but: 1) The operation must be implemented in the exporting driver; this will take some time. 2) In theory, there may be no overlap between the exporting device and the importing device. OTOH I'm not aware of any real-world example, so we can probably return a suitable error code, and that's it. Anyway, I have already written in another reply that my original use case is moot, because a more recent distribution can do the job without using dma-buf, so it has been fixed in user space, be it in GNOME, pipewire, or Mesa (I don't really have to know). At this point I would go with the assumption that large buffers allocated by media subsystems will not hit swiotlb. Consequently, I don't plan to spend more time on this branch of the story. > > BTW my testing also suggests that the streaming DMA API is quite > > inefficient, because UAS performance _improved_ with swiotlb=force. > > Sure, this should probably be addressed in the UAS and/or xHCI driver, > > but what I mean is that moving away from swiotlb may even cause > > performance regressions, which is counter-intuitive. At least I would > > _not_ have expected it. > > That is indeed very odd. Are you running with a very slow iommu > driver there? Or what is the actual use case there in general? This was on a Raspberry Pi 4, which does not have any IOMMU. IOW it looks like copying data around can be faster than sending it straight to the device. When I have some more time, I must investigate what is really happening there, because it does not make any sense to me. Petr T