Re: noveau vs arm dma ops

Daniel Vetter <daniel@xxxxxxxx> · Thu, 26 Apr 2018 11:17:46 +0200



On Thu, Apr 26, 2018 at 1:26 AM, Russell King - ARM Linux
<linux@xxxxxxxxxxxxxxx> wrote:
> On Wed, Apr 25, 2018 at 11:35:13PM +0200, Daniel Vetter wrote:
>> On arm that doesn't work. The iommu api seems like a good fit, except
>> the dma-api tends to get in the way a bit (drm/msm apparently has
>> similar problems like tegra), and if you need contiguous memory
>> dma_alloc_coherent is the only way to get at contiguous memory. There
>> was a huge discussion years ago about that, and direct cma access was
>> shot down because it would have exposed too much of the caching
>> attribute mangling required (most arm platforms need wc-pages to not
>> be in the kernel's linear map apparently).
>
> I think you completely misunderstand ARM from what you've written above,
> and this worries me greatly about giving DRM the level of control that
> is being asked for.
>
> Modern ARMs have a PIPT cache or a non-aliasing VIPT cache, and cache
> attributes are stored in the page tables.  These caches are inherently
> non-aliasing when there are multiple mappings (which is a great step
> forward compared to the previous aliasing caches.)
>
> As the cache attributes are stored in the page tables, this in theory
> allows different virtual mappings of the same physical memory to have
> different cache attributes.  However, there's a problem, and that's
> called speculative prefetching.
>
> Let's say you have one mapping which is cacheable, and another that is
> marked as write combining.  If a cache line is speculatively prefetched
> through the cacheable mapping of this memory, and then you read the
> same physical location through the write combining mapping, it is
> possible that you could read cached data.
>
> So, it is generally accepted that all mappings of any particular
> physical bit of memory should have the same cache attributes to avoid
> unpredictable behaviour.
>
> This presents a problem with what is generally called "lowmem" where
> the memory is mapped in kernel virtual space with cacheable
> attributes.  It can also happen with highmem if the memory is
> kmapped.
>
> This is why, on ARM, you can't use something like get_free_pages() to
> grab some pages from the system, pass it to the GPU, map it into
> userspace as write-combining, etc.  It _might_ work for some CPUs,
> but ARM CPUs vary in how much prefetching they do, and what may work
> for one particular CPU is in no way guaranteed to work for another
> ARM CPU.
>
> The official line from architecture folk is to assume that the caches
> infinitely speculate, are of infinite size, and can writeback *dirty*
> data at any moment.
>
> The way to stop things like speculative prefetches to particular
> physical memory is to, quite "simply", not have any cacheable
> mappings of that physical memory anywhere in the system.
>
> Now, cache flushes on ARM tend to be fairly expensive for GPU buffers.
> If you have, say, an 8MB buffer (for a 1080p frame) and you need to
> do a cache operation on that buffer, you'll be iterating over it
> 32 or maybe 64 bytes at a time "just in case" there's a cache line
> present.  Referring to my previous email, where I detailed the
> potential need for _two_ flushes, one before the GPU operation and
> one after, and this becomes _really_ expensive.  At that point, you're
> probably way better off using write-combine memory where you don't
> need to spend CPU cycles performing cache flushing - potentially
> across all CPUs in the system if cache operations aren't broadcasted.
>
> This isn't a simple matter of "just provide some APIs for cache
> operations" - there's much more that needs to be understood by
> all parties here, especially when we have GPU drivers that can be
> used with quite different CPUs.
>
> It may well be that for some combinations of CPUs and workloads, it's
> better to use write-combine memory without cache flushing, but for
> other CPUs that tradeoff (for the same workload) could well be
> different.
>
> Older ARMs get more interesting, because they have aliasing caches.
> That means the CPU cache aliases across different virtual space
> mappings in some way, which complicates (a) the mapping of memory
> and (b) handling the cache operations on it.
>
> It's too late for me to go into that tonight, and I probably won't
> be reading mail for the next week and a half, sorry.

I didn't know all the details well enough (and neither had the time to
write a few paragraphs like you did), but the above is what I had in
mind and meant. Sorry if my sloppy reply sounded like I'm mixing stuff
up.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch