Re: [PATCH] usb: ehci: Enable support for 64bit EHCI host controllers in arm64

Arnd Bergmann <arnd@xxxxxxxx> · Wed, 21 May 2014 17:57:04 +0200

On Wednesday 21 May 2014 16:32:16 Catalin Marinas wrote:
> On Wed, May 21, 2014 at 03:43:39PM +0100, Arnd Bergmann wrote:
> > On Wednesday 21 May 2014 14:56:36 Catalin Marinas wrote:
> > > On Tue, May 20, 2014 at 08:37:10PM +0100, Arnd Bergmann wrote:
> > > > On Tuesday 20 May 2014 15:12:49 Catalin Marinas wrote:
> > > > The point I was trying to make is a different one: For the streaming mapping,
> > > > you can allow any mask as long as the swiotlb is able to create a bounce
> > > > buffer.
> > > 
> > > dma_mask is a hardware parameter and is used by the streaming DMA API
> > > implementation (which could be swiotlb) to decide whether to bounce or
> > > not. The streaming DMA API can take any CPU address as input,
> > > independent of the dma_mask, and bounce accordingly. If it doesn't have
> > > a bounce buffer matching the dma_mask, dma_supported() would return
> > > false.
> > >
> > > > This does not work for the coherent mapping, because that by definition
> > > > cannot use bounce buffers.
> > > 
> > > Yes but most of the time these masks have the same value.
> > 
> > Let me start again with an example:
> > 
> > io_tlb_end	=    0x1000000; // 16 MB
> > dma_mask 	=   0x10000000; // 256 MB
> > ZONE_DMA	=  0x100000000; // 4 GB
> > max_pfn		= 0x1000000000; // 64 GB
> > 
> > The device driver has set dma_mask and dma_coherent_mask to 256 because
> > of a limitation in the hardware. This succeeded because io_tlb_end is
> > well within the dma mask.
> > 
> > Since the coherent_dma_mask is smaller than max_pfn, the swiotlb version
> > of dma_alloc_coherent then allocates the buffer using GFP_DMA. However,
> > that likely returns an address above dma_mask/coherent_dma_mask.
> 
> swiotlb_alloc_coherent() first tries __get_free_pages(GFP_DMA). If the
> returned address is not within coherent_dma_mask, it frees the allocated
> pages and tries to allocate from its bounce buffer (io_tlb).

Ok, got it now. Thanks for your patience. I suppose this was done
to maintain the API guarantee that dma_set_coherent_mask(dev, dma_get_mask(dev))
always succeeds.

> > My point was that to fix this case, you must not compare the
> > coherent_dma_mask requested by the device against io_tlb_end but against
> > ZONE_DMA.
> 
> See above, I think swiotlb does the right thing.
> 
> We have a problem, however, with CMA, as we assume that the returned
> address is within coherent_dma_mask (we need a contiguous_dma_supported
> function, similar to swiotlb_dma_supported). We also don't limit the CMA
> buffer to ZONE_DMA on arm64.

Right. I guess the answer to that is to teach CMA to honor GFP_DMA.
This was probably never needed on ARM32 but is an obvious extension.

> > > > As far as I understand it, you are not allowed
> > > > to create a noncacheable mapping for a page that also has a cachable
> > > > mapping, or did that change between armv7 and armv8?
> > > 
> > > No change between ARMv7 and ARMv8. In fact, ARMv7 has clarified what
> > > "unpredictable" means when you mix different attributes. Basically, we
> > > can mix the same memory type (Normal Cacheable vs Normal NonCacheable)
> > > but if the cacheability is different, we need to do some cache
> > > maintenance before accessing the non-cacheable mapping the first time
> > > (clear potentially dirty lines that could be evicted randomly). Any
> > > speculatively loaded cache lines via the cacheable mapping would not be
> > > hit when accessed via the non-cacheable one.
> > 
> > Ah, only for the first access? I had remembered this differently, thanks
> > for the clarification.
> 
> The first time, but assuming that you no longer write to the cacheable
> alias to dirty it again (basically it's like the streaming DMA, if you
> want to access the cacheable alias you need cache maintenance).

Ok.

> > > > For CMA, we use the trick to change the mapping in place, but that
> > > > doesn't work for normal pages, which are mapped using huge tlbs.
> > > 
> > > That's true for arm32 and we could do the same for arm64, though I don't
> > > think its worth because we could break very huge mappings (e.g. 1GB). We
> > > have plenty of VA space and we just create a new non-coherent mapping
> > > (this latter part could be optimised, even generically, to use huge
> > > pages where possible).
> > 
> > I thought we specifically did this on arm32 to avoid the duplicate
> > mapping because we had concluded that it would be broken even with the
> > updated ARMv7 definition.
> 
> We did it because it was vaguely defined as "unpredictable" in the ARM
> ARM for a long time until the hw guys realised it's not easy to change
> Linux and decided to clarify what could actually happen (see A3.5.7 in
> the ARMv7 ARM, "Mismatched memory attributes").
> 
> But in the past we used to have coherent DMA buffers mapped as Strongly
> Ordered and this would be a different memory type than Normal
> (NonCacheable). But even in this case, ARM ARM now states that such
> mapping is permitted but the Strongly Ordered properties are not
> guaranteed (could simply behave line Normal NonCacheable).

Right.

> > > > > Of course, another approach would be to change the cacheability
> > > > > attributes of the io_tlb buffer (or the CMA one) but current approach
> > > > > has the advantage that the io_tlb buffer can be used for both coherent
> > > > > and non-coherent devices.
> > > > 
> > > > That definitely shouldn't be done without performance testing. However
> > > > I wonder if they were done in the first place for the swiotlb: It's
> > > > possible that it's cheaper to have the bounce buffer for noncoherent
> > > > devices be mapped noncachable so we do cache bypassing copy into the
> > > > bounce buffer and out of it and save the extra flushes.
> > > 
> > > I doesn't do this AFAICT but it's an interesting idea. Well, to be
> > > optimised when we can benchmark this on hardware.
> > > 
> > > One issue though is when some (or all) devices are coherent. In this
> > > case, even if you use a bounce buffer, it needs to be mapped coherently
> > > at the CPU level (e.g. the device has its own cache that can be snooped,
> > > so CPU accesses need to be cacheable).
> > 
> > Do you mean there are cases where an uncached buffer is not coherent
> > with a device cache?
> 
> Yes, if for example you have a device (e.g. GPU, micro-controller) on a
> coherent interconnect, non-cacheable CPU accesses are not guaranteed to
> (probably should not) hit into the other device's cache (it could be
> seen as a side-effect of allowing cacheable vs non-cacheable aliases).

Rihgt, makes sense. So if we do this trick, it should only be done for
devices that are not coherent to start with.

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html