Re: [PATCH] usb: ehci: Enable support for 64bit EHCI host controllers in arm64

Arnd Bergmann <arnd@xxxxxxxx> · Tue, 20 May 2014 21:37:10 +0200

On Tuesday 20 May 2014 15:12:49 Catalin Marinas wrote:
> On Tue, May 20, 2014 at 12:08:58PM +0100, Arnd Bergmann wrote:
> > On Tuesday 20 May 2014 12:02:46 Catalin Marinas wrote:
> > > On Mon, May 19, 2014 at 05:55:56PM +0100, Arnd Bergmann wrote:
> > > > On Monday 19 May 2014 16:56:08 Catalin Marinas wrote:
> > > > > On Mon, May 19, 2014 at 10:44:51AM +0100, Arnd Bergmann wrote:
> > > > > > On Monday 19 May 2014 10:03:40 Catalin Marinas wrote:
> > > > We probably want to default to 32-bit for arm32 in the absence of dma-ranges.
> > > > For arm64, I'd prefer if we could always mandate dma-ranges to be present
> > > > for each bus, just like we mandate ranges to be present.
> > > > I hope it's not too late for that.
> > > > 
> > > > dma_set_mask should definitely look at the dma-ranges properties, and the
> > > > helper that Santosh just introduced should give us all the information
> > > > we need. We just need to decide on the correct behavior.
> > > 
> > > Last time I looked at Santosh's patches I thought the dma-ranges is per
> > > device rather than per bus. We could make it per bus only and let the
> > > device call dma_set_mask() explicitly if it wants to restrict it
> > > further.
> > 
> > Can you check again? I've read the code again yesterday to check this,
> > and I concluded it was correctly doing this per bus.
> 
> You are right, I missed the fact that of_dma_get_range() checks the
> dma-ranges property in the node's parent.
> 
> So what we need is setting the default dma mask based on the size
> information in dma-ranges to something like this:
> 
> 	mask = rounddown_pow_of_two(size) - 1;
> 	dma_set_mask(dev, mask);	/* or dma_set_mask_and_coherent() */

I don't think we should be calling dma_set_mask here, since that would
just go and parse the same property again once we fix the implementation.

Since this is a low-level helper, we can probably just assign the dma mask
directly.

> > > > > > While we currently don't have a set of swiotlb DMA ops on ARM32, we do
> > > > > > have it on ARM64, and I think we should be using them properly. It should
> > > > > > really not be hard to implement a proper dma_set_mask() function for
> > > > > > ARM64 that gets is able to set up the swiotlb based on the dma-ranges
> > > > > > properties and always returns success but leaves the mask unchanged.
> > > > > 
> > > > > The swiotlb bounce buffer needs to be pre-allocated at boot, otherwise
> > > > > we don't have any guarantees. Since we can't honour random masks anyway,
> > > > > we stick to ZONE_DMA which is currently in the 4G limit. But the driver
> > > > > calls dma_set_mask() too late for any further swiotlb setup.
> > > > > 
> > > > > With IOMMU we can be more flexible around dma_set_mask(), can be done at
> > > > > run-time.
> > > > 
> > > > What we can do with swiotlb is to check if the mask is smaller than ZONE_DMA.
> > > > If it ever is, we have to fail dma_set_mask and hope the driver can fall
> > > > back to PIO mode or it will have to fail its probe() function.
> > > 
> > > dma_set_(coherent_)mask check swiotlb_dma_supported() which returns
> > > false if io_tlb_end goes beyond the device mask. So we just need to
> > > ensure that io_tlb is allocated within ZONE_DMA.
> > 
> > Makes sense for dma_set_mask. Why do you do the same thing for
> > coherent_mask? Shouldn't that check against ZONE_DMA instead?
> 
> It depends on the meaning of coherent_dma_mask. As I understand it,
> that's the dma mask use for dma_alloc_coherent() to return a memory
> buffer that the device is able to access. I don't see it much different
> from the dma_mask used by the streaming API. I guess some hardware could
> have different masks here if they have cache coherency only for certain
> memory ranges (and the coherent_dma_mask would be smaller than dma_mask).

The point I was trying to make is a different one: For the streaming mapping,
you can allow any mask as long as the swiotlb is able to create a bounce
buffer. This does not work for the coherent mapping, because that by definition
cannot use bounce buffers.

> > > > For dma_set_coherent_mask(), we also have to fail any call that tries to
> > > > set a mask larger than what the device hardware can do. Unlike that,
> > > > dma_set_mask() can succeed with any mask, we just have to enable swiotlb
> > > > if the mask that the driver wants is larger than what the hardware can
> > > > do.
> > > 
> > > Currently we can't satisfy any arbitrarily small dma mask even with
> > > swiotlb since the bounce buffer is just guaranteed to be in ZONE_DMA.
> > > Swiotlb allows for smaller masks but we need to reserve the io_tlb
> > > buffer early during boot and at smaller addresses. For example,
> > > swiotlb_alloc_coherent() first tries __get_free_pages(GFP_DMA) and if
> > > the coherent_dma_mask isn't matched, it frees the pages and falls back
> > > to the io_tlb buffer. However, I don't think it's worth going for masks
> > > smaller than 32-bit on arm64.
> > 
> > Is that safe for noncoherent systems? I'd expect the io_tlb buffer
> > to be cached there, which means we can't use it for coherent allocations.
> 
> For non-coherent systems, the current arm64 approach is to take the
> address returned by swiotlb_alloc_coherent() and remap it as Normal
> NonCacheable (see the arm64 __dma_alloc_noncoherent()). The CPU should
> only access the dma buffer via the non-cacheable mapping.

Doesn't that require the allocated page to be mapped using small pages
in the linear mapping? As far as I understand it, you are not allowed
to create a noncacheable mapping for a page that also has a cachable
mapping, or did that change between armv7 and armv8?

For CMA, we use the trick to change the mapping in place, but that
doesn't work for normal pages, which are mapped using huge tlbs.

> Of course, another approach would be to change the cacheability
> attributes of the io_tlb buffer (or the CMA one) but current approach
> has the advantage that the io_tlb buffer can be used for both coherent
> and non-coherent devices.

That definitely shouldn't be done without performance testing. However
I wonder if they were done in the first place for the swiotlb: It's
possible that it's cheaper to have the bounce buffer for noncoherent
devices be mapped noncachable so we do cache bypassing copy into the
bounce buffer and out of it and save the extra flushes.

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html