Re: [RFC 00/16] padata, vfio, sched: Multithreaded VFIO page pinning

Jason Gunthorpe <jgg@xxxxxxxxxx> · Fri, 7 Jan 2022 13:12:48 -0400

> > It is also not good that this inserts arbitary cuts in the IOVA
> > address space, that will cause iommu_map() to be called with smaller
> > npages, and could result in a long term inefficiency in the iommu.
> > 
> > I don't know how the kernel can combat this without prior knowledge of
> > the likely physical memory layout (eg is the VM using 1G huge pages or
> > something)..
>
> The cuts aren't arbitrary, padata controls where they happen.  

Well, they are, you picked a PMD alignment if I recall.

If hugetlbfs is using PUD pages then this is the wrong alignment,
right?

I suppose it could compute the cuts differently to try to maximize
alignment at the cutpoints.. 

> size.  If cuts in per-thread ranges are an issue, I *think* userspace
> has the same problem?

Userspace should know what it has done, if it is using hugetlbfs it
knows how big the pages are.

> > The results you got of only 1.2x improvement don't seem so
> > compelling.
> 
> I know you understand, but just to be clear for everyone, that 1.2x is
> the overall improvement to qemu init from multithreaded pinning alone
> when prefaulting is done in both base and test.

Yes

> Pinning itself, the only thing being optimized, improves 8.5x in that
> experiment, bringing the time from 1.8 seconds to .2 seconds.  That's a
> significant savings IMHO

And here is where I suspect we'd get similar results from folio's
based on the unpin performance uplift we already saw.

As long as PUP doesn't have to COW its work is largely proportional to
the number of struct pages it processes, so we should be expecting an
upper limit of 512x gains on the PUP alone with foliation. This is in
line with what we saw with the prior unpin work.

The other optimization that would help a lot here is to use
pin_user_pages_fast(), something like:

  if (current->mm != remote_mm)
     mmap_lock()
     pin_user_pages_remote(..)
     mmap_unlock()
  else
     pin_user_pages_fast(..)

But you can't get that gain with kernel-size parallization, right?

(I haven't dug into if gup_fast relies on current due to IPIs or not,
maybe pin_user_pages_remote_fast can exist?)

> But, I'm skeptical that singlethreaded optimization alone will remove
> the bottleneck with the enormous memory sizes we use.  

I think you can get the 1.2x at least.

> scaling up the times from the unpin results with both optimizations (the
> IB specific one too, which would need to be done for vfio), 

Oh, I did the IB one already in iommufd...

> a 1T guest would still take almost 2 seconds to pin/unpin.

Single threaded? Isn't that excellent and completely dwarfed by the
populate overhead?

> If people feel strongly that we should try optimizing other ways first,
> ok, but I think these are complementary approaches.  I'm coming at this
> problem this way because this is fundamentally a memory-intensive
> operation where more bandwidth can help, and there are other kernel
> paths we and others want this infrastructure for.

At least here I would like to see an apples to apples at least before
we have this complexity. Full user threading vs kernel auto threading.

Saying multithreaded kernel gets 8x over single threaded userspace is
nice, but sort of irrelevant because we can have multithreaded
userspace, right?

Jason