Hello Niklas, thank you for your report. This is some great analysis! On Fri, 03 Nov 2023 16:13:03 +0100 Niklas Schnelle <schnelle@xxxxxxxxxxxxx> wrote: > Hi Swiotlb Maintainers, > > With s390 slated to use dma-iommu.c I was experimenting with > CONFIG_SWIOTLB_DYNAMIC but was getting errors all over the place. > Debugging this I've managed to narrow things down to what I believe is > a memory corruption issue caused by overrunning the entire transient > memory pool allocated by swiotlb. In my test this happens on the > iommu_dma_map_page(), swiotlb_tbl_map_single() path when handling > untrusted PCI devices. > > I've seen this happen only for transient pools when: > * allocation size >=PAGE_SIZE and, > * the original address of the mapping is not page aligned. > > The overrun can be seen clearly by adding a simple "tlb_addr + > alloc_size > pool->end' overrun check to swiotlb_tbl_map_single() and > forcing PCI test devices to be untrusted. With that in place while an > NVMe fio work load runs fine TCP/IP tests on a Mellanox ConnectX-4 VF > reliably trigger the overrun check and with a debug print produce > output like the following: > > software IO TLB: > transient:1 > index:1 > dma_get_min_align_mask(dev):0 > orig_addr:ac0caebe > tlb_addr=a4d0f800 > start:a4d0f000 > end:a4d10000 > pool_size:4096 > alloc_size:4096 > offset:0 > > The complete code used for this test is available on my > git.kernel.org[0] repository but it's bascially v6.6 + iommu/next (for > s390 DMA API) + 2 trivial debug commits. > > For further analysis I've worked closely with Halil Pasic. > > The reason why we think this is happening seems twofold. Under a > certain set of circumstances in the function swiotlb_find_slots(): > 1) the allocated transient pool can not fit the required bounce buffer, I am aware that this can happen. It's in fact why the index returned by swiotlb_search_pool_area() is checked in swiotlb_find_slots(). > and > 2) the invocation of swiotlb_pool_find_slots() finds "a suitable > slot" even though it should fail. This needs fixing. > The reason for 2), i.e. swiotlb_pool_find_slots() thinking there is a > suitable bounce buffer in the pool is that for transient pools pool- > >slots[i].list end up nonsensical, because swiotlb_init_io_tlb_pool(), > among other places in swiotlb, assumes that the nslabs of the pool is a > multiple of IO_TLB_SEGSIZE Ah... Yes, the transient pool size must also be a multiple of segment size, but it is not enforced. This reminds me of my debugging session that resulted in commit aabd12609f91 ("swiotlb: always set the number of areas before allocating the pool") and this conversation: https://lore.kernel.org/linux-iommu/20230629074403.7f8ed9d6@xxxxxxxxxxxxxxxxxxxx/ > The reason for 1) is a bit more convoluted and not entirely understood > by us. We are certain though that the function swiotlb_find_slots() > allocates a pool with nr_slots(alloc_size), where this alloc_size is > the alloc_size from swiotlb_tbl_map_single() + swiotlb_align_offset(), > but for alignment reasons some slots may get "skipped over" in > swiotlb_area_find_slots() causing the picked slots to overrun the > allocation. Yes, this can happen. > Not sure how to properly fix this as the different alignment > requirements get pretty complex quickly. So would appreciate your > input. I don't think it's possible to improve the allocation logic without modifying the page allocator and/or the DMA atomic pool allocator to take additional constraints into account. I had a wild idea back in March, but it would require some intrusive changes in the mm subsystem. Among other things, it would make memory zones obsolete. I mean, people may actually like to get rid of DMA, DMA32 and NORMAL, but you see how many nasty bugs were introduced even by a relatively small change in SWIOTLB. Replacing memory zones with a system based on generic physical allocation constraints would probably blow up the universe. ;-) Petr T