On Wednesday 21 May 2014 16:32:16 Catalin Marinas wrote: > On Wed, May 21, 2014 at 03:43:39PM +0100, Arnd Bergmann wrote: > > On Wednesday 21 May 2014 14:56:36 Catalin Marinas wrote: > > > On Tue, May 20, 2014 at 08:37:10PM +0100, Arnd Bergmann wrote: > > > > On Tuesday 20 May 2014 15:12:49 Catalin Marinas wrote: > > > > The point I was trying to make is a different one: For the streaming mapping, > > > > you can allow any mask as long as the swiotlb is able to create a bounce > > > > buffer. > > > > > > dma_mask is a hardware parameter and is used by the streaming DMA API > > > implementation (which could be swiotlb) to decide whether to bounce or > > > not. The streaming DMA API can take any CPU address as input, > > > independent of the dma_mask, and bounce accordingly. If it doesn't have > > > a bounce buffer matching the dma_mask, dma_supported() would return > > > false. > > > > > > > This does not work for the coherent mapping, because that by definition > > > > cannot use bounce buffers. > > > > > > Yes but most of the time these masks have the same value. > > > > Let me start again with an example: > > > > io_tlb_end = 0x1000000; // 16 MB > > dma_mask = 0x10000000; // 256 MB > > ZONE_DMA = 0x100000000; // 4 GB > > max_pfn = 0x1000000000; // 64 GB > > > > The device driver has set dma_mask and dma_coherent_mask to 256 because > > of a limitation in the hardware. This succeeded because io_tlb_end is > > well within the dma mask. > > > > Since the coherent_dma_mask is smaller than max_pfn, the swiotlb version > > of dma_alloc_coherent then allocates the buffer using GFP_DMA. However, > > that likely returns an address above dma_mask/coherent_dma_mask. > > swiotlb_alloc_coherent() first tries __get_free_pages(GFP_DMA). If the > returned address is not within coherent_dma_mask, it frees the allocated > pages and tries to allocate from its bounce buffer (io_tlb). Ok, got it now. Thanks for your patience. I suppose this was done to maintain the API guarantee that dma_set_coherent_mask(dev, dma_get_mask(dev)) always succeeds. > > My point was that to fix this case, you must not compare the > > coherent_dma_mask requested by the device against io_tlb_end but against > > ZONE_DMA. > > See above, I think swiotlb does the right thing. > > We have a problem, however, with CMA, as we assume that the returned > address is within coherent_dma_mask (we need a contiguous_dma_supported > function, similar to swiotlb_dma_supported). We also don't limit the CMA > buffer to ZONE_DMA on arm64. Right. I guess the answer to that is to teach CMA to honor GFP_DMA. This was probably never needed on ARM32 but is an obvious extension. > > > > As far as I understand it, you are not allowed > > > > to create a noncacheable mapping for a page that also has a cachable > > > > mapping, or did that change between armv7 and armv8? > > > > > > No change between ARMv7 and ARMv8. In fact, ARMv7 has clarified what > > > "unpredictable" means when you mix different attributes. Basically, we > > > can mix the same memory type (Normal Cacheable vs Normal NonCacheable) > > > but if the cacheability is different, we need to do some cache > > > maintenance before accessing the non-cacheable mapping the first time > > > (clear potentially dirty lines that could be evicted randomly). Any > > > speculatively loaded cache lines via the cacheable mapping would not be > > > hit when accessed via the non-cacheable one. > > > > Ah, only for the first access? I had remembered this differently, thanks > > for the clarification. > > The first time, but assuming that you no longer write to the cacheable > alias to dirty it again (basically it's like the streaming DMA, if you > want to access the cacheable alias you need cache maintenance). Ok. > > > > For CMA, we use the trick to change the mapping in place, but that > > > > doesn't work for normal pages, which are mapped using huge tlbs. > > > > > > That's true for arm32 and we could do the same for arm64, though I don't > > > think its worth because we could break very huge mappings (e.g. 1GB). We > > > have plenty of VA space and we just create a new non-coherent mapping > > > (this latter part could be optimised, even generically, to use huge > > > pages where possible). > > > > I thought we specifically did this on arm32 to avoid the duplicate > > mapping because we had concluded that it would be broken even with the > > updated ARMv7 definition. > > We did it because it was vaguely defined as "unpredictable" in the ARM > ARM for a long time until the hw guys realised it's not easy to change > Linux and decided to clarify what could actually happen (see A3.5.7 in > the ARMv7 ARM, "Mismatched memory attributes"). > > But in the past we used to have coherent DMA buffers mapped as Strongly > Ordered and this would be a different memory type than Normal > (NonCacheable). But even in this case, ARM ARM now states that such > mapping is permitted but the Strongly Ordered properties are not > guaranteed (could simply behave line Normal NonCacheable). Right. > > > > > Of course, another approach would be to change the cacheability > > > > > attributes of the io_tlb buffer (or the CMA one) but current approach > > > > > has the advantage that the io_tlb buffer can be used for both coherent > > > > > and non-coherent devices. > > > > > > > > That definitely shouldn't be done without performance testing. However > > > > I wonder if they were done in the first place for the swiotlb: It's > > > > possible that it's cheaper to have the bounce buffer for noncoherent > > > > devices be mapped noncachable so we do cache bypassing copy into the > > > > bounce buffer and out of it and save the extra flushes. > > > > > > I doesn't do this AFAICT but it's an interesting idea. Well, to be > > > optimised when we can benchmark this on hardware. > > > > > > One issue though is when some (or all) devices are coherent. In this > > > case, even if you use a bounce buffer, it needs to be mapped coherently > > > at the CPU level (e.g. the device has its own cache that can be snooped, > > > so CPU accesses need to be cacheable). > > > > Do you mean there are cases where an uncached buffer is not coherent > > with a device cache? > > Yes, if for example you have a device (e.g. GPU, micro-controller) on a > coherent interconnect, non-cacheable CPU accesses are not guaranteed to > (probably should not) hit into the other device's cache (it could be > seen as a side-effect of allowing cacheable vs non-cacheable aliases). Rihgt, makes sense. So if we do this trick, it should only be done for devices that are not coherent to start with. Arnd -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html