On 14/6/18 10:35 pm, David Gibson wrote: > On Thu, Jun 14, 2018 at 04:35:18PM +1000, Alexey Kardashevskiy wrote: >> On 12/6/18 2:17 pm, David Gibson wrote: >>> On Fri, Jun 08, 2018 at 03:46:33PM +1000, Alexey Kardashevskiy wrote: >>>> At the moment we allocate the entire TCE table, twice (hardware part and >>>> userspace translation cache). This normally works as we normally have >>>> contigous memory and the guest will map entire RAM for 64bit DMA. >>>> >>>> However if we have sparse RAM (one example is a memory device), then >>>> we will allocate TCEs which will never be used as the guest only maps >>>> actual memory for DMA. If it is a single level TCE table, there is nothing >>>> we can really do but if it a multilevel table, we can skip allocating >>>> TCEs we know we won't need. >>>> >>>> This adds ability to allocate only first level, saving memory. >>>> >>>> This changes iommu_table::free() to avoid allocating of an extra level; >>>> iommu_table::set() will do this when needed. >>>> >>>> This adds @alloc parameter to iommu_table::exchange() to tell the callback >>>> if it can allocate an extra level; the flag is set to "false" for >>>> the realmode KVM handlers of H_PUT_TCE hcalls and the callback returns >>>> H_TOO_HARD. >>>> >>>> This still requires the entire table to be counted in mm::locked_vm. >>>> >>>> To be conservative, this only does on-demand allocation when >>>> the usespace cache table is requested which is the case of VFIO. >>>> >>>> The example math for a system replicating a powernv setup with NVLink2 >>>> in a guest: >>>> 16GB RAM mapped at 0x0 >>>> 128GB GPU RAM window (16GB of actual RAM) mapped at 0x244000000000 >>>> >>>> the table to cover that all with 64K pages takes: >>>> (((0x244000000000 + 0x2000000000) >> 16)*8)>>20 = 4556MB >>>> >>>> If we allocate only necessary TCE levels, we will only need: >>>> (((0x400000000 + 0x400000000) >> 16)*8)>>20 = 4MB (plus some for indirect >>>> levels). >>>> >>>> Signed-off-by: Alexey Kardashevskiy <aik@xxxxxxxxx> >>>> --- >>>> arch/powerpc/include/asm/iommu.h | 7 ++- >>>> arch/powerpc/platforms/powernv/pci.h | 6 ++- >>>> arch/powerpc/kvm/book3s_64_vio_hv.c | 4 +- >>>> arch/powerpc/platforms/powernv/pci-ioda-tce.c | 69 ++++++++++++++++++++------- >>>> arch/powerpc/platforms/powernv/pci-ioda.c | 8 ++-- >>>> drivers/vfio/vfio_iommu_spapr_tce.c | 2 +- >>>> 6 files changed, 69 insertions(+), 27 deletions(-) >>>> >>>> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h >>>> index 4bdcf22..daa3ee5 100644 >>>> --- a/arch/powerpc/include/asm/iommu.h >>>> +++ b/arch/powerpc/include/asm/iommu.h >>>> @@ -70,7 +70,7 @@ struct iommu_table_ops { >>>> unsigned long *hpa, >>>> enum dma_data_direction *direction); >>>> >>>> - __be64 *(*useraddrptr)(struct iommu_table *tbl, long index); >>>> + __be64 *(*useraddrptr)(struct iommu_table *tbl, long index, bool alloc); >>>> #endif >>>> void (*clear)(struct iommu_table *tbl, >>>> long index, long npages); >>>> @@ -122,10 +122,13 @@ struct iommu_table { >>>> __be64 *it_userspace; /* userspace view of the table */ >>>> struct iommu_table_ops *it_ops; >>>> struct kref it_kref; >>>> + int it_nid; >>>> }; >>>> >>>> +#define IOMMU_TABLE_USERSPACE_ENTRY_RM(tbl, entry) \ >>>> + ((tbl)->it_ops->useraddrptr((tbl), (entry), false)) >>> >>> Is real mode really the only case where you want to inhibit new >>> allocations? I would have thought some paths would be read-only and >>> you wouldn't want to allocate, even in virtual mode. >> >> There are paths when I do not want allocation but I can figure out that >> from dma direction flag, for example, I am cleaning up the table and I do >> not want any extra allocation to happen there and they do happen because I >> made a mistake so I'll repost. Other than that, this @alloc flag is for >> real mode only. > > Hm, ok. > > Something just occurred to me: IIRC, going from the vmalloc() to the > explicit table structure was to avoid allocating pages for memory > regions that aren't there due to sparseness. But.. won't you get that > with vmalloc() anyway, if the pages for the gaps never get > instantiated? Nope, vmalloc() always allocates all the pages, these are not swapable (checked with Paul). -- Alexey