On Tue, Aug 16, 2022 at 01:41:53PM +0530, Kashyap Desai wrote: > This issue frequently hits if AMD IOMMU is enabled. > > In case of 1MB data transfer, ib core is supposed to set 256 entries of > 4K page size in MR page table. Because of the defect in ib_sg_to_pages, > it breaks just after setting one entry. > Memory region page table entries may find stale entries (or NULL if > address is memset). Something like this - > > crash> x/32a 0xffff9cd9f7f84000 > <<< -This looks like stale entries. Only first entry is valid ->>> > 0xffff9cd9f7f84000: 0xfffffffffff00000 0x68d31000 > 0xffff9cd9f7f84010: 0x68d32000 0x68d33000 > 0xffff9cd9f7f84020: 0x68d34000 0x975c5000 > 0xffff9cd9f7f84030: 0x975c6000 0x975c7000 > 0xffff9cd9f7f84040: 0x975c8000 0x975c9000 > 0xffff9cd9f7f84050: 0x975ca000 0x975cb000 > 0xffff9cd9f7f84060: 0x975cc000 0x975cd000 > 0xffff9cd9f7f84070: 0x975ce000 0x975cf000 > 0xffff9cd9f7f84080: 0x0 0x0 > 0xffff9cd9f7f84090: 0x0 0x0 > 0xffff9cd9f7f840a0: 0x0 0x0 > 0xffff9cd9f7f840b0: 0x0 0x0 > 0xffff9cd9f7f840c0: 0x0 0x0 > 0xffff9cd9f7f840d0: 0x0 0x0 > 0xffff9cd9f7f840e0: 0x0 0x0 > 0xffff9cd9f7f840f0: 0x0 0x0 > > All addresses other than 0xfffffffffff00000 are stale entries. > Once this kind of incorrect page entries are passed to the RDMA h/w, > AMD IOMMU module detects the page fault whenever h/w tries to access > addresses which are not actually populated by the ib stack correctly. > Below prints are logged whenever this issue hits. I don't understand this. AFAIK on AMD platforms you can't create an IOVA mapping at -1 like you are saying above, so how is 0xfffffffffff00000 a valid DMA address? Or, if the AMD IOMMU HW can actually do this, then I would say it is a bug in the IOMM DMA API to allow the aperture used for DMA mapping to get to the end of ULONG_MAX, it is just asking for overflow bugs. And if we have to tolerate these addreses then the code should be designed to avoid the overflow in the first place ie 'end_dma_addr' should be changed to 'last_dma_addr = dma_addr + (dma_len - 1)' which does not overflow, and all the logics carefully organized so none of the math overflows. Jason