On Tue, Jun 11, 2024 at 09:17:56AM -0300, Jason Gunthorpe wrote: > On Mon, Jun 10, 2024 at 05:44:16PM -0700, Nicolin Chen wrote: > > On Mon, Jun 10, 2024 at 09:28:39PM -0300, Jason Gunthorpe wrote: > > > On Mon, Jun 10, 2024 at 04:04:30PM -0700, Nicolin Chen wrote: > > > > > > > > > Actually, even now as we put a dispatcher in VMM, VMM still does > > > > > > decode the CD table to link ASID to s1_hwpt. Otherwise, it could > > > > > > only broadcast a TLBI cmd to all pSMMUs. > > > > > > > > > > No, there should be no CD table decoding and no linking ASID to > > > > > anything by the VMM. > > > > > > > > > > The ARM architecture is clean, the ASID can remain private to the VM, > > > > > there is no reason for the VMM to understand it. > > > > > > > > But a guest-level TLBI command usually has only ASID available to > > > > know which pSMMU to dispatch the command. Without an ASID lookup > > > > table, how could VMM then dispatch a command to the corresponding > > > > pSMMU? > > > > > > It can broadcast. The ARM architecture does not expect a N:1 mapping > > > of SMMUs. This is why I think it is not such a good idea.. > > > > Hmm, I thought we had an agreed idea that we shouldn't broadcast > > a TLBI (except global NH_ALL/VAA) for invalidation performance? > > I wouldn't say agree, there are just lots of different trade offs to > be made here. You can reduce broadcast by parsing the CD table from > the VMM. You can reduce broadcast with multiple vSMMUs. > > VMM needs to pick a design. I favour multiple vSMMUs. Yea, having multiple vSMMUs for nesting too seems to be a cleaner design. The thing is that we have to put a certain complexity in the VMM, and it should be more efficient by having it at the boot stage (creating multi-vSMMUs/PCIs and IORT nodes) v.s. runtime (trappings and distributing at every command). > > CD table walkthrough would be always done only by VMM, while the > > lookup table could be created/maintained by the kernel. I feel a > > vasid table could make sense since we maintain the vdev_id table > > in the kernel space too. > > I'm not convinced we should put such a micro optimization in the > kernel. If the VMM cares about performance then split the vSMMU, > otherwise lets just put all the mess in the VMM and give it the tools > to manage the invalidation distribution. > > In the long run things like vCMDQ are going to force the choice to > multi-smmu, which is why I don't see too much value with investing in > optimizing the single vSMMU case. The optimization can be done later > if someone does have a use case. I see. That makes sense to me. Nicolin