Re: [PATCH] usb: ehci: Enable support for 64bit EHCI host controllers in arm64

Arnd Bergmann <arnd@xxxxxxxx> · Wed, 21 May 2014 16:43:39 +0200

On Wednesday 21 May 2014 14:56:36 Catalin Marinas wrote:
> On Tue, May 20, 2014 at 08:37:10PM +0100, Arnd Bergmann wrote:
> > On Tuesday 20 May 2014 15:12:49 Catalin Marinas wrote:
> > > On Tue, May 20, 2014 at 12:08:58PM +0100, Arnd Bergmann wrote:
> > > > On Tuesday 20 May 2014 12:02:46 Catalin Marinas wrote:
> > > > > On Mon, May 19, 2014 at 05:55:56PM +0100, Arnd Bergmann wrote:
> > > > > > On Monday 19 May 2014 16:56:08 Catalin Marinas wrote:
> > > > > > > On Mon, May 19, 2014 at 10:44:51AM +0100, Arnd Bergmann wrote:
> > > > > > > > On Monday 19 May 2014 10:03:40 Catalin Marinas wrote:
> > > > > > We probably want to default to 32-bit for arm32 in the absence of dma-ranges.
> > > > > > For arm64, I'd prefer if we could always mandate dma-ranges to be present
> > > > > > for each bus, just like we mandate ranges to be present.
> > > > > > I hope it's not too late for that.
> > > > > > 
> > > > > > dma_set_mask should definitely look at the dma-ranges properties, and the
> > > > > > helper that Santosh just introduced should give us all the information
> > > > > > we need. We just need to decide on the correct behavior.
> > > > > 
> > > > > Last time I looked at Santosh's patches I thought the dma-ranges is per
> > > > > device rather than per bus. We could make it per bus only and let the
> > > > > device call dma_set_mask() explicitly if it wants to restrict it
> > > > > further.
> > > > 
> > > > Can you check again? I've read the code again yesterday to check this,
> > > > and I concluded it was correctly doing this per bus.
> > > 
> > > You are right, I missed the fact that of_dma_get_range() checks the
> > > dma-ranges property in the node's parent.
> > > 
> > > So what we need is setting the default dma mask based on the size
> > > information in dma-ranges to something like this:
> > > 
> > > 	mask = rounddown_pow_of_two(size) - 1;
> > > 	dma_set_mask(dev, mask);	/* or dma_set_mask_and_coherent() */
> > 
> > I don't think we should be calling dma_set_mask here, since that would
> > just go and parse the same property again once we fix the implementation.
> > 
> > Since this is a low-level helper, we can probably just assign the dma mask
> > directly.
> 
> I was thinking of calling it in of_dma_configure() as that's where we
> already set the default dma_mask and coherent_dma_mask. Default to
> 32-bit if no dma-ranges are present.

Right. Actually it should also be capped to 32-bit, to allow compatibility
with drivers that don't call dma_set_mask and can't do 64-bit DMA. This
is the normal behavior for PCI drivers. They need to set a 64-bit mask
and check the result.

> > > > > > > > While we currently don't have a set of swiotlb DMA ops on ARM32, we do
> > > > > > > > have it on ARM64, and I think we should be using them properly. It should
> > > > > > > > really not be hard to implement a proper dma_set_mask() function for
> > > > > > > > ARM64 that gets is able to set up the swiotlb based on the dma-ranges
> > > > > > > > properties and always returns success but leaves the mask unchanged.
> > > > > > > 
> > > > > > > The swiotlb bounce buffer needs to be pre-allocated at boot, otherwise
> > > > > > > we don't have any guarantees. Since we can't honour random masks anyway,
> > > > > > > we stick to ZONE_DMA which is currently in the 4G limit. But the driver
> > > > > > > calls dma_set_mask() too late for any further swiotlb setup.
> > > > > > > 
> > > > > > > With IOMMU we can be more flexible around dma_set_mask(), can be done at
> > > > > > > run-time.
> > > > > > 
> > > > > > What we can do with swiotlb is to check if the mask is smaller than ZONE_DMA.
> > > > > > If it ever is, we have to fail dma_set_mask and hope the driver can fall
> > > > > > back to PIO mode or it will have to fail its probe() function.
> > > > > 
> > > > > dma_set_(coherent_)mask check swiotlb_dma_supported() which returns
> > > > > false if io_tlb_end goes beyond the device mask. So we just need to
> > > > > ensure that io_tlb is allocated within ZONE_DMA.
> > > > 
> > > > Makes sense for dma_set_mask. Why do you do the same thing for
> > > > coherent_mask? Shouldn't that check against ZONE_DMA instead?
> > > 
> > > It depends on the meaning of coherent_dma_mask. As I understand it,
> > > that's the dma mask use for dma_alloc_coherent() to return a memory
> > > buffer that the device is able to access. I don't see it much different
> > > from the dma_mask used by the streaming API. I guess some hardware could
> > > have different masks here if they have cache coherency only for certain
> > > memory ranges (and the coherent_dma_mask would be smaller than dma_mask).
> > 
> > The point I was trying to make is a different one: For the streaming mapping,
> > you can allow any mask as long as the swiotlb is able to create a bounce
> > buffer.
> 
> dma_mask is a hardware parameter and is used by the streaming DMA API
> implementation (which could be swiotlb) to decide whether to bounce or
> not. The streaming DMA API can take any CPU address as input,
> independent of the dma_mask, and bounce accordingly. If it doesn't have
> a bounce buffer matching the dma_mask, dma_supported() would return
> false.
>
> > This does not work for the coherent mapping, because that by definition
> > cannot use bounce buffers.
> 
> Yes but most of the time these masks have the same value.

Let me start again with an example:

io_tlb_end	=    0x1000000; // 16 MB
dma_mask 	=   0x10000000; // 256 MB
ZONE_DMA	=  0x100000000; // 4 GB
max_pfn		= 0x1000000000; // 64 GB

The device driver has set dma_mask and dma_coherent_mask to 256 because
of a limitation in the hardware. This succeeded because io_tlb_end is
well within the dma mask.

Since the coherent_dma_mask is smaller than max_pfn, the swiotlb version
of dma_alloc_coherent then allocates the buffer using GFP_DMA. However,
that likely returns an address above dma_mask/coherent_dma_mask.

My point was that to fix this case, you must not compare the
coherent_dma_mask requested by the device against io_tlb_end but against
ZONE_DMA.

> > > > > > For dma_set_coherent_mask(), we also have to fail any call that tries to
> > > > > > set a mask larger than what the device hardware can do. Unlike that,
> > > > > > dma_set_mask() can succeed with any mask, we just have to enable swiotlb
> > > > > > if the mask that the driver wants is larger than what the hardware can
> > > > > > do.
> > > > > 
> > > > > Currently we can't satisfy any arbitrarily small dma mask even with
> > > > > swiotlb since the bounce buffer is just guaranteed to be in ZONE_DMA.
> > > > > Swiotlb allows for smaller masks but we need to reserve the io_tlb
> > > > > buffer early during boot and at smaller addresses. For example,
> > > > > swiotlb_alloc_coherent() first tries __get_free_pages(GFP_DMA) and if
> > > > > the coherent_dma_mask isn't matched, it frees the pages and falls back
> > > > > to the io_tlb buffer. However, I don't think it's worth going for masks
> > > > > smaller than 32-bit on arm64.
> > > > 
> > > > Is that safe for noncoherent systems? I'd expect the io_tlb buffer
> > > > to be cached there, which means we can't use it for coherent allocations.
> > > 
> > > For non-coherent systems, the current arm64 approach is to take the
> > > address returned by swiotlb_alloc_coherent() and remap it as Normal
> > > NonCacheable (see the arm64 __dma_alloc_noncoherent()). The CPU should
> > > only access the dma buffer via the non-cacheable mapping.
> > 
> > Doesn't that require the allocated page to be mapped using small pages
> > in the linear mapping?
> 
> Not in the linear mapping but in the vmalloc space (for arm64).

Ok

> > As far as I understand it, you are not allowed
> > to create a noncacheable mapping for a page that also has a cachable
> > mapping, or did that change between armv7 and armv8?
> 
> No change between ARMv7 and ARMv8. In fact, ARMv7 has clarified what
> "unpredictable" means when you mix different attributes. Basically, we
> can mix the same memory type (Normal Cacheable vs Normal NonCacheable)
> but if the cacheability is different, we need to do some cache
> maintenance before accessing the non-cacheable mapping the first time
> (clear potentially dirty lines that could be evicted randomly). Any
> speculatively loaded cache lines via the cacheable mapping would not be
> hit when accessed via the non-cacheable one.

Ah, only for the first access? I had remembered this differently, thanks
for the clarification.

> > For CMA, we use the trick to change the mapping in place, but that
> > doesn't work for normal pages, which are mapped using huge tlbs.
> 
> That's true for arm32 and we could do the same for arm64, though I don't
> think its worth because we could break very huge mappings (e.g. 1GB). We
> have plenty of VA space and we just create a new non-coherent mapping
> (this latter part could be optimised, even generically, to use huge
> pages where possible).

I thought we specifically did this on arm32 to avoid the duplicate
mapping because we had concluded that it would be broken even with the
updated ARMv7 definition.

> > > Of course, another approach would be to change the cacheability
> > > attributes of the io_tlb buffer (or the CMA one) but current approach
> > > has the advantage that the io_tlb buffer can be used for both coherent
> > > and non-coherent devices.
> > 
> > That definitely shouldn't be done without performance testing. However
> > I wonder if they were done in the first place for the swiotlb: It's
> > possible that it's cheaper to have the bounce buffer for noncoherent
> > devices be mapped noncachable so we do cache bypassing copy into the
> > bounce buffer and out of it and save the extra flushes.
> 
> I doesn't do this AFAICT but it's an interesting idea. Well, to be
> optimised when we can benchmark this on hardware.
> 
> One issue though is when some (or all) devices are coherent. In this
> case, even if you use a bounce buffer, it needs to be mapped coherently
> at the CPU level (e.g. the device has its own cache that can be snooped,
> so CPU accesses need to be cacheable).

Do you mean there are cases where an uncached buffer is not coherent
with a device cache?

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html