On Wed, 18 Dec 2019 12:20:44 +0100 Jean-Philippe Brucker <jean-philippe@xxxxxxxxxx> wrote: > On Tue, Dec 03, 2019 at 07:01:36PM -0800, Jacob Pan (Jun) wrote: > > Hi Jean, > > > > Sorry for the delay, I was out last week. Comments inline below. > > > > On Mon, 25 Nov 2019 19:02:47 +0100 > > Jean-Philippe Brucker <jean-philippe@xxxxxxxxxx> wrote: > > > > > On Fri, Nov 22, 2019 at 04:01:02PM -0800, Jacob Pan (Jun) wrote: > > > > > (1) ACPI has one table per vendor (DMAR for Intel, IVRS for > > > > > AMD and IORT for Arm). From my point of view IORT is easier to > > > > > extend, since we just need to introduce a new node type. There > > > > > are no dependencies to Arm in the Linux IORT driver, so it > > > > > works well with CONFIG_X86. > > > > From my limited understanding, IORT and VIOT is to solve device > > > > topology enumeration only? I am not sure how it can be expanded > > > > to cover information beyond device topology. e.g. DMAR has NUMA > > > > information and root port ATS, I guess they are not used today > > > > in the guest but might be additions in the future. > > > > > > The PCI root-complex node of IORT has an ATS attribute, which we > > > can already use. However its scope is the root complex, not > > > individual root ports like with DMAR. > > > > > > I'm not very familiar with NUMA, but it looks like we just need to > > > specify a proximity domain in relation to the SRAT table, for each > > > viommu? The SMMUv3 node in IORT has a 4-bytes "proximity domain" > > > field for this. We can add the same to the VIOT virtio-iommu nodes > > > later, since the structures are extensible. > > > > > I think there the proximity domain is more for each assigned device > > than vIOMMU. vIOMMU in the guest can have assigned devices belong to > > different pIOMMU and proximity domains. If the guest owns the first > > level page tables (gIOVA or SVA), we want to make sure page tables > > are allocated from the close proximity domain. > > > > My understanding is virtio IOMMU supports both virtio devices and > > assigned devices. we could care less about the former in terms of > > NUMA. > > > > In ACPI, we have _PXM method to retrieve device proximity domain. I > > don't know if there is something equivalent or a generic way to get > > _PXM information. I think VMM also need to make sure when an > > assigned device is used with vIOMMU, there are some memory is > > allocated from the device's proximity domain. > > > > > But it might be better to keep the bare minimum information in > > > the FW descriptor, and put the rest in the virtio-iommu. So yes > > > topology enumeration is something the device cannot do itself > > > (not fully that is, see (2)) but for the rest, virtio-iommu's > > > PROBE request can provide details about each endpoint in relation > > > to their physical IOMMU. > > > > > > We could for example add a bit in a PROBE property saying that the > > > whole path between the IOMMU and the endpoint supports ATS. For > > > NUMA it might also be more interesting to have a finer > > > granularity, since one viommu could be managing endpoints that > > > are behind different physical IOMMUs. If in the future we want to > > > allocate page tables close to the physical IOMMU for example, we > > > might need to describe multiple NUMA nodes per viommu, using the > > > PROBE request. > > Should we reinvent something for NUMA or use ACPI's SRAT, _PXM? > > Regardless whether we put it in the VIOT or in the virtio-iommu PROBE > request, we necessarily need to reuse the node IDs defined by SRAT (or > numa-node-id on devicetree, also a 32-bit value). A virtio-pci based > virtio-iommu already has the _PXM of its closest bridge and wouldn't > need anything more in the VIOT, while a virtio-mmio based > virtio-iommu would need a proximity domain field in the VIOT. That > could be added later since the table is extensible, but as you > pointed out, that information might not be very useful. > > > I am not sure how it is handled today in QEMU in terms of guest-host > > NUMA proximity domain mapping. > > It looks like the user can specify this guest-host mapping on the > command-line: > > -object memory-backend-ram,id=mem0,size=4G,host-nodes=3,policy=bind > -object memory-backend-ram,id=mem1,size=4G,host-nodes=4,policy=bind > -numa node,memdev=mem0,nodeid=numa0 > -numa node,memdev=mem1,nodeid=numa1 > -numa cpu,node-id=numa0,socket-id=0 > -numa cpu,node-id=numa1,socket-id=1 > > numa0 and numa1 would get proximity domains 0 and 1, corresponding to > host domains 3 and 4. It is also possible to specify the NUMA node of > a PCI bus (via the PCI expander bridge), and therefore to assign a > VFIO PCI device in the same proximity domain as its physical location. > > -device pxb,id=bridge1,bus=pci.0,numa_node=1 (simplified) > -device vfio-pci,host=03:01.0,bus=bridge1 > Thanks a lot for the thorough explanation. I will give that a try on x86, I assume the ACPI tables also built to match these cmdline options. > Linux can use this information to allocate DMA close to the endpoint > (see for example __iommu_dma_alloc_pages()). For page tables > allocation, io-pgtables currently takes the node ID of the IOMMU > device, not the endpoint. For the scenario you describe (virtio-iommu > endpoints managed by different physical IOMMU), we would need to take > for example the node ID of the first endpoint in the iommu_domain for > which we're allocating page tables. > If iommu_domain is shared by multiple device from different NUMA node, I guess taking the first one is as good as anyone. It would not be an optimal configuration. > Is it safe to assume that the pIOMMU is in the same proximity domain > as the physical endpoint? I think it is a safe assumption. > If that's the case, then the guest already > has all the information it needs. Otherwise it's easy to add a > proximity domain PROBE property for each endpoint. Configuring the > host to pass that information might be more difficult. > I agree, guest should always allocate DMA and IOVA page tables basedon the endpoint. VT-d currently allocates page table pages based on IOMMU NUMA node, that might have to change. > > Off topic, I've been wondering how to make iommu-sva aware of NUMA > topology as well, so that when handling a page request we allocate > memory on the faulting device's NUMA node, but I think it might > require invasive changes to the mm subsystem, to pass a NUMA node to > handle_mm_fault(). > > Thanks, > Jean