On 2017/11/1 20:26, Robin Murphy wrote: > On 01/11/17 07:46, Wei Hu (Xavier) wrote: >> >> On 2017/10/12 20:59, Robin Murphy wrote: >>> On 12/10/17 13:31, Wei Hu (Xavier) wrote: >>>> On 2017/10/1 0:10, Leon Romanovsky wrote: >>>>> On Sat, Sep 30, 2017 at 05:28:59PM +0800, Wei Hu (Xavier) wrote: >>>>>> If the IOMMU is enabled, the length of sg obtained from >>>>>> __iommu_map_sg_attrs is not 4kB. When the IOVA is set with the sg >>>>>> dma address, the IOVA will not be page continuous. and the VA >>>>>> returned from dma_alloc_coherent is a vmalloc address. However, >>>>>> the VA obtained by the page_address is a discontinuous VA. Under >>>>>> these circumstances, the IOVA should be calculated based on the >>>>>> sg length, and record the VA returned from dma_alloc_coherent >>>>>> in the struct of hem. >>>>>> >>>>>> Signed-off-by: Wei Hu (Xavier) <xavier.huwei@xxxxxxxxxx> >>>>>> Signed-off-by: Shaobo Xu <xushaobo2@xxxxxxxxxx> >>>>>> Signed-off-by: Lijun Ou <oulijun@xxxxxxxxxx> >>>>>> --- >>>>> Doug, >>>>> >>>>> I didn't invest time in reviewing it, but having "is_vmalloc_addr" in >>>>> driver code to deal with dma_alloc_coherent is most probably wrong. >>>>> >>>>> Thanks >>>> Hi, Leon & Doug >>>> We refered the function named __ttm_dma_alloc_page in the kernel >>>> code as below: >>>> And there are similar methods in bch_bio_map and mem_to_page >>>> functions in current 4.14-rcx. >>>> >>>> static struct dma_page *__ttm_dma_alloc_page(struct dma_pool *pool) >>>> { >>>> struct dma_page *d_page; >>>> >>>> d_page = kmalloc(sizeof(struct dma_page), GFP_KERNEL); >>>> if (!d_page) >>>> return NULL; >>>> >>>> d_page->vaddr = dma_alloc_coherent(pool->dev, pool->size, >>>> &d_page->dma, >>>> pool->gfp_flags); >>>> if (d_page->vaddr) { >>>> if (is_vmalloc_addr(d_page->vaddr)) >>>> d_page->p = vmalloc_to_page(d_page->vaddr); >>>> else >>>> d_page->p = virt_to_page(d_page->vaddr); >>> There are cases on various architectures where neither of those is >>> right. Whether those actually intersect with TTM or RDMA use-cases is >>> another matter, of course. >>> >>> What definitely is a problem is if you ever take that page and end up >>> accessing it through any virtual address other than the one explicitly >>> returned by dma_alloc_coherent(). That can blow the coherency wide open >>> and invite data loss, right up to killing the whole system with a >>> machine check on certain architectures. >>> >>> Robin. >> Hi, Robin >> Thanks for your comment. >> >> We have one problem and the related code as below. >> 1. call dma_alloc_coherent function serval times to alloc memory. >> 2. vmap the allocated memory pages. >> 3. software access memory by using the return virt addr of vmap >> and hardware using the dma addr of dma_alloc_coherent. > The simple answer is "don't do that". Seriously. dma_alloc_coherent() > gives you a CPU virtual address and a DMA address with which to access > your buffer, and that is the limit of what you may infer about it. You > have no guarantee that the virtual address is either in the linear map > or vmalloc, and not some other special place. You have no guarantee that > the underlying memory even has an associated struct page at all. > >> When IOMMU is disabled in ARM64 architecture, we use virt_to_page() >> before vmap(), it works. And when IOMMU is enabled using >> virt_to_page() will cause calltrace later, we found the return >> addr of dma_alloc_coherent is vmalloc addr, so we add the >> condition judgement statement as below, it works. >> for (i = 0; i < buf->nbufs; ++i) >> pages[i] = >> is_vmalloc_addr(buf->page_list[i].buf) ? >> vmalloc_to_page(buf->page_list[i].buf) : >> virt_to_page(buf->page_list[i].buf); >> Can you give us suggestion? better method? > Oh my goodness, having now taken a closer look at this driver, I'm lost > for words in disbelief. To pick just one example: > > u32 bits_per_long = BITS_PER_LONG; > ... > if (bits_per_long == 64) { > /* memory mapping nonsense */ > } > > WTF does the size of a long have to do with DMA buffer management!? > > Of course I can guess that it might be trying to make some tortuous > inference about vmalloc space being constrained on 32-bit platforms, but > still... > >> The related code as below: >> buf->page_list = kcalloc(buf->nbufs, sizeof(*buf->page_list), >> GFP_KERNEL); >> if (!buf->page_list) >> return -ENOMEM; >> >> for (i = 0; i < buf->nbufs; ++i) { >> buf->page_list[i].buf = dma_alloc_coherent(dev, >> page_size, &t, >> GFP_KERNEL); >> if (!buf->page_list[i].buf) >> goto err_free; >> >> buf->page_list[i].map = t; >> memset(buf->page_list[i].buf, 0, page_size); >> } >> >> pages = kmalloc_array(buf->nbufs, sizeof(*pages), >> GFP_KERNEL); >> if (!pages) >> goto err_free; >> >> for (i = 0; i < buf->nbufs; ++i) >> pages[i] = >> is_vmalloc_addr(buf->page_list[i].buf) ? >> vmalloc_to_page(buf->page_list[i].buf) : >> virt_to_page(buf->page_list[i].buf); >> >> buf->direct.buf = vmap(pages, buf->nbufs, VM_MAP, >> PAGE_KERNEL); >> kfree(pages); >> if (!buf->direct.buf) >> goto err_free; > OK, this is complete crap. As above, you cannot assume that a struct > page even exists; even if it does you cannot assume that using a > PAGE_KERNEL mapping will not result in mismatched attributes, > unpredictable behaviour and data loss. Trying to remap coherent DMA > allocations like this is just egregiously wrong. > > What I do like is that you can seemingly fix all this by simply deleting > hns_roce_buf::direct and all the garbage code related to it, and using > the page_list entries consistently because the alternate paths involving > those appear to do the right thing already. > > That is, of course, assuming that the buffers involved can be so large > that it's not practical to just always make a single allocation and > fragment it into multiple descriptors if the hardware does have some > maximum length constraint - frankly I'm a little puzzled by the > PAGE_SIZE * 2 threshold, given that that's not a fixed size. > > Robin. Hi,Robin We reconstruct the code as below: It replaces dma_alloc_coherent with __get_free_pages and dma_map_single functions. So, we can vmap serveral ptrs returned by __get_free_pages, right? buf->page_list = kcalloc(buf->nbufs, sizeof(*buf->page_list), GFP_KERNEL); if (!buf->page_list) return -ENOMEM; for (i = 0; i < buf->nbufs; ++i) { ptr = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, get_order(page_size)); if (!ptr) { dev_err(dev, "Alloc pages error.\n"); goto err_free; } t = dma_map_single(dev, ptr, page_size, DMA_BIDIRECTIONAL); if (dma_mapping_error(dev, t)) { dev_err(dev, "DMA mapping error.\n"); free_pages((unsigned long)ptr, get_order(page_size)); goto err_free; } buf->page_list[i].buf = ptr; buf->page_list[i].map = t; } pages = kmalloc_array(buf->nbufs, sizeof(*pages), GFP_KERNEL); if (!pages) goto err_free; for (i = 0; i < buf->nbufs; ++i) pages[i] = virt_to_page(buf->page_list[i].buf); buf->direct.buf = vmap(pages, buf->nbufs, VM_MAP, PAGE_KERNEL); kfree(pages); if (!buf->direct.buf) goto err_free; Regards Wei Hu >> Regards >> Wei Hu >>>> } else { >>>> kfree(d_page); >>>> d_page = NULL; >>>> } >>>> return d_page; >>>> } >>>> >>>> Regards >>>> Wei Hu >>>>>> drivers/infiniband/hw/hns/hns_roce_alloc.c | 5 ++++- >>>>>> drivers/infiniband/hw/hns/hns_roce_hem.c | 30 >>>>>> +++++++++++++++++++++++++++--- >>>>>> drivers/infiniband/hw/hns/hns_roce_hem.h | 6 ++++++ >>>>>> drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 22 +++++++++++++++------- >>>>>> 4 files changed, 52 insertions(+), 11 deletions(-) >>>>>> >>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_alloc.c >>>>>> b/drivers/infiniband/hw/hns/hns_roce_alloc.c >>>>>> index 3e4c525..a69cd4b 100644 >>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_alloc.c >>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_alloc.c >>>>>> @@ -243,7 +243,10 @@ int hns_roce_buf_alloc(struct hns_roce_dev >>>>>> *hr_dev, u32 size, u32 max_direct, >>>>>> goto err_free; >>>>>> >>>>>> for (i = 0; i < buf->nbufs; ++i) >>>>>> - pages[i] = virt_to_page(buf->page_list[i].buf); >>>>>> + pages[i] = >>>>>> + is_vmalloc_addr(buf->page_list[i].buf) ? >>>>>> + vmalloc_to_page(buf->page_list[i].buf) : >>>>>> + virt_to_page(buf->page_list[i].buf); >>>>>> >>>>>> buf->direct.buf = vmap(pages, buf->nbufs, VM_MAP, >>>>>> PAGE_KERNEL); >>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.c >>>>>> b/drivers/infiniband/hw/hns/hns_roce_hem.c >>>>>> index 8388ae2..4a3d1d4 100644 >>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_hem.c >>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hem.c >>>>>> @@ -200,6 +200,7 @@ static struct hns_roce_hem >>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev, >>>>>> gfp_t gfp_mask) >>>>>> { >>>>>> struct hns_roce_hem_chunk *chunk = NULL; >>>>>> + struct hns_roce_vmalloc *vmalloc; >>>>>> struct hns_roce_hem *hem; >>>>>> struct scatterlist *mem; >>>>>> int order; >>>>>> @@ -227,6 +228,7 @@ static struct hns_roce_hem >>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev, >>>>>> sg_init_table(chunk->mem, HNS_ROCE_HEM_CHUNK_LEN); >>>>>> chunk->npages = 0; >>>>>> chunk->nsg = 0; >>>>>> + memset(chunk->vmalloc, 0, sizeof(chunk->vmalloc)); >>>>>> list_add_tail(&chunk->list, &hem->chunk_list); >>>>>> } >>>>>> >>>>>> @@ -243,7 +245,15 @@ static struct hns_roce_hem >>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev, >>>>>> if (!buf) >>>>>> goto fail; >>>>>> >>>>>> - sg_set_buf(mem, buf, PAGE_SIZE << order); >>>>>> + if (is_vmalloc_addr(buf)) { >>>>>> + vmalloc = &chunk->vmalloc[chunk->npages]; >>>>>> + vmalloc->is_vmalloc_addr = true; >>>>>> + vmalloc->vmalloc_addr = buf; >>>>>> + sg_set_page(mem, vmalloc_to_page(buf), >>>>>> + PAGE_SIZE << order, offset_in_page(buf)); >>>>>> + } else { >>>>>> + sg_set_buf(mem, buf, PAGE_SIZE << order); >>>>>> + } >>>>>> WARN_ON(mem->offset); >>>>>> sg_dma_len(mem) = PAGE_SIZE << order; >>>>>> >>>>>> @@ -262,17 +272,25 @@ static struct hns_roce_hem >>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev, >>>>>> void hns_roce_free_hem(struct hns_roce_dev *hr_dev, struct >>>>>> hns_roce_hem *hem) >>>>>> { >>>>>> struct hns_roce_hem_chunk *chunk, *tmp; >>>>>> + void *cpu_addr; >>>>>> int i; >>>>>> >>>>>> if (!hem) >>>>>> return; >>>>>> >>>>>> list_for_each_entry_safe(chunk, tmp, &hem->chunk_list, list) { >>>>>> - for (i = 0; i < chunk->npages; ++i) >>>>>> + for (i = 0; i < chunk->npages; ++i) { >>>>>> + if (chunk->vmalloc[i].is_vmalloc_addr) >>>>>> + cpu_addr = chunk->vmalloc[i].vmalloc_addr; >>>>>> + else >>>>>> + cpu_addr = >>>>>> + lowmem_page_address(sg_page(&chunk->mem[i])); >>>>>> + >>>>>> dma_free_coherent(hr_dev->dev, >>>>>> chunk->mem[i].length, >>>>>> - lowmem_page_address(sg_page(&chunk->mem[i])), >>>>>> + cpu_addr, >>>>>> sg_dma_address(&chunk->mem[i])); >>>>>> + } >>>>>> kfree(chunk); >>>>>> } >>>>>> >>>>>> @@ -774,6 +792,12 @@ void *hns_roce_table_find(struct hns_roce_dev >>>>>> *hr_dev, >>>>>> >>>>>> if (chunk->mem[i].length > (u32)offset) { >>>>>> page = sg_page(&chunk->mem[i]); >>>>>> + if (chunk->vmalloc[i].is_vmalloc_addr) { >>>>>> + mutex_unlock(&table->mutex); >>>>>> + return page ? >>>>>> + chunk->vmalloc[i].vmalloc_addr >>>>>> + + offset : NULL; >>>>>> + } >>>>>> goto out; >>>>>> } >>>>>> offset -= chunk->mem[i].length; >>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.h >>>>>> b/drivers/infiniband/hw/hns/hns_roce_hem.h >>>>>> index af28bbf..62d712a 100644 >>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_hem.h >>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hem.h >>>>>> @@ -72,11 +72,17 @@ enum { >>>>>> HNS_ROCE_HEM_PAGE_SIZE = 1 << HNS_ROCE_HEM_PAGE_SHIFT, >>>>>> }; >>>>>> >>>>>> +struct hns_roce_vmalloc { >>>>>> + bool is_vmalloc_addr; >>>>>> + void *vmalloc_addr; >>>>>> +}; >>>>>> + >>>>>> struct hns_roce_hem_chunk { >>>>>> struct list_head list; >>>>>> int npages; >>>>>> int nsg; >>>>>> struct scatterlist mem[HNS_ROCE_HEM_CHUNK_LEN]; >>>>>> + struct hns_roce_vmalloc vmalloc[HNS_ROCE_HEM_CHUNK_LEN]; >>>>>> }; >>>>>> >>>>>> struct hns_roce_hem { >>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c >>>>>> b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c >>>>>> index b99d70a..9e19bf1 100644 >>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c >>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c >>>>>> @@ -1093,9 +1093,11 @@ static int hns_roce_v2_write_mtpt(void >>>>>> *mb_buf, struct hns_roce_mr *mr, >>>>>> { >>>>>> struct hns_roce_v2_mpt_entry *mpt_entry; >>>>>> struct scatterlist *sg; >>>>>> + u64 page_addr = 0; >>>>>> u64 *pages; >>>>>> + int i = 0, j = 0; >>>>>> + int len = 0; >>>>>> int entry; >>>>>> - int i; >>>>>> >>>>>> mpt_entry = mb_buf; >>>>>> memset(mpt_entry, 0, sizeof(*mpt_entry)); >>>>>> @@ -1153,14 +1155,20 @@ static int hns_roce_v2_write_mtpt(void >>>>>> *mb_buf, struct hns_roce_mr *mr, >>>>>> >>>>>> i = 0; >>>>>> for_each_sg(mr->umem->sg_head.sgl, sg, mr->umem->nmap, entry) { >>>>>> - pages[i] = ((u64)sg_dma_address(sg)) >> 6; >>>>>> - >>>>>> - /* Record the first 2 entry directly to MTPT table */ >>>>>> - if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1) >>>>>> - break; >>>>>> - i++; >>>>>> + len = sg_dma_len(sg) >> PAGE_SHIFT; >>>>>> + for (j = 0; j < len; ++j) { >>>>>> + page_addr = sg_dma_address(sg) + >>>>>> + (j << mr->umem->page_shift); >>>>>> + pages[i] = page_addr >> 6; >>>>>> + >>>>>> + /* Record the first 2 entry directly to MTPT table */ >>>>>> + if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1) >>>>>> + goto found; >>>>>> + i++; >>>>>> + } >>>>>> } >>>>>> >>>>>> +found: >>>>>> mpt_entry->pa0_l = cpu_to_le32(lower_32_bits(pages[0])); >>>>>> roce_set_field(mpt_entry->byte_56_pa0_h, V2_MPT_BYTE_56_PA0_H_M, >>>>>> V2_MPT_BYTE_56_PA0_H_S, >>>>>> -- >>>>>> 1.9.1 >>>>>> >>>> _______________________________________________ >>>> iommu mailing list >>>> iommu@xxxxxxxxxxxxxxxxxxxxxxxxxx >>>> https://lists.linuxfoundation.org/mailman/listinfo/iommu >>> . >>> >> > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > > . > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html