> > On Fri, Aug 19, 2022 at 03:13:47PM +0530, Kashyap Desai wrote: > > > > All addresses other than 0xfffffffffff00000 are stale entries. > > > > Once this kind of incorrect page entries are passed to the RDMA > > > > h/w, AMD IOMMU module detects the page fault whenever h/w tries to > > > > access addresses which are not actually populated by the ib stack correctly. > > > > Below prints are logged whenever this issue hits. > > > > > > I don't understand this. AFAIK on AMD platforms you can't create an > > > IOVA mapping at -1 like you are saying above, so how is > > > 0xfffffffffff00000 a valid DMA address? > > > > Hi Jason - > > > > Let me simplify - Consider a case if 1 SGE has 8K dma_len and starting > > dma address is <0xffffffffffffe000>. > > It is expected to have two page table entry - <0xffffffffffffe000 > > > and > > <0xfffffffffffff000 >. > > Both the DMA address not mapped to -1. Device expose dma_mask_bits = 64, > > so above two addresses are valid mapping from IOMMU perspective. > > That is not my point. > > My point is that 0xFFFFFFFFFFFFFFF should never be used as a DMA address > because it invites overflow on any maths, and we are not careful about this in > the kernel in general. I am not seeing Address overflow case. It is just that buffer is ending at "0xffffffffffffffff" and it is a genuine dma buffer. So, worst case scenario is DMA address = fffffffffffffffe and dma_len = 1 byte. This must be handled as genuine dma request. > > > Since end_dma_addr will be zero (in current code) which is actually > > not end_dma_addr but potential next_dma_addr, we will only endup > > set_page() call one time. > > Which is the math overflow. Let's take this case - ib_sg_to_pages(struct ib_mr *mr, struct scatterlist *sgl, int sg_nents, unsigned int *sg_offset_p, int (*set_page)(struct ib_mr *, u64)) sg_nents = 1 struct scatterlist { unsigned long page_link; unsigned int offset; = 0 unsigned int length; = 8192 dma_addr_t dma_address; = 0xffffffffffffe000 unsigned int dma_length; = 8192 Below loop will run only one time. for_each_sg(sgl, sg, sg_nents, i) { Now, we will enter into below loop with dma_addr = page_addr = 0xffffffffffffe000 and "end_dma_addr = dma_addr + dma_len" is ZERO. eval 0xffffffffffffe000 + 8192 hexadecimal: 0 do { ret = set_page(mr, page_addr); <<< - This callback will be called one time with page_addr = 0xffffffffffffe000 if (unlikely(ret < 0)) { sg_offset = prev_addr - sg_dma_address(sg); mr->length += prev_addr - dma_addr; if (sg_offset_p) *sg_offset_p = sg_offset; return i || sg_offset ? i : ret; } prev_addr = page_addr; next_page: page_addr += mr->page_size; <<< - After one iteration page_addr = 0xfffffffffffff000 } while (page_addr < end_dma_addr); <<< - This loop will break because (0xfffffffffffff000 < 0) is not true. > > > I think this is a valid mapping request and don't see any issue with > > IOMMU mapping is incorrect. > > It should not create mappings that are so dangerous. There is really no reason to > use the last page of IOVA space that includes -1. That is correct, but if API which deals with mapping they handle this kind of request gracefully is needed. Right ? > > > > And if we have to tolerate these addreses then the code should be > > > designed to avoid the overflow in the first place ie 'end_dma_addr' > > > should be changed to 'last_dma_addr = dma_addr + (dma_len - 1)' > > > which does not overflow, and all the logics carefully organized so > > > none of the math overflows. > > > > Making 'last_dma_addr = dma_addr + (dma_len - 1)' will have another > > side effect. > > Yes, the patch would have to fix everything about the logic to work with a last > and avoid overflowing maths Noted. > > > How about just doing below ? This was my initial thought of fixing but > > I am not sure which approach Is best. > > > > diff --git a/drivers/infiniband/core/verbs.c > > b/drivers/infiniband/core/verbs.c index e54b3f1b730e..56d1f3b20e98 > > 100644 > > --- a/drivers/infiniband/core/verbs.c > > +++ b/drivers/infiniband/core/verbs.c > > @@ -2709,7 +2709,7 @@ int ib_sg_to_pages(struct ib_mr *mr, struct > > scatterlist *sgl, int sg_nents, > > prev_addr = page_addr; > > next_page: > > page_addr += mr->page_size; > > - } while (page_addr < end_dma_addr); > > + } while (page_addr < (end_dma_addr - 1)); > > This is now overflowing twice :( I thought about better approach without creating regression and I found having loop using sg_dma_len can avoid such issues gracefully. How about original patch. ? > > Does this bug even still exist? eg does this revert "fix" it? > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/ ?id=a > f3e9579ecfb Above revert is part of my test. In my setup "iommu_dma_forcedac = $2 = false". Without above revert, it may be possible that I can hit the issue frequently. Currently I need heavy IO of 1MB to hit this issue. Almost ~8GB is attempted in my test. Total 64 NVME target of 128 QD is sending 1MB IO. Looks like first DMA mapping is attempted from <4GB and whenever it exhaust and start mapping > 4GB memory region, this kind of IOV mapping occurs. Kashyap > > It makes me wonder if the use of -1 is why drivers started failing with this mode. > > Jason
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature