On Mon, Jan 10, 2022 at 05:27:25PM -0500, Daniel Jordan wrote: > > > Pinning itself, the only thing being optimized, improves 8.5x in that > > > experiment, bringing the time from 1.8 seconds to .2 seconds. That's a > > > significant savings IMHO > > > > And here is where I suspect we'd get similar results from folio's > > based on the unpin performance uplift we already saw. > > > > As long as PUP doesn't have to COW its work is largely proportional to > > the number of struct pages it processes, so we should be expecting an > > upper limit of 512x gains on the PUP alone with foliation. > > > > This is in line with what we saw with the prior unpin work. > > "in line with what we saw" Not following. The unpin work had two > optimizations, I think, 4.5x and 3.5x which together give 16x. Why is > that in line with the potential gains from pup? It is the same basic issue, doing extra work, dirtying extra memory.. > > and completely dwarfed by the populate overhead? > > Well yes, but here we all are optimizing gup anyway :-) Well, I assume because we can user thread the populate, so I'd user thread the gup too.. > One of my assumptions was that doing this in the kernel would benefit > all vfio users, avoiding duplicating the same sort of multithreading > logic across applications, including ones that didn't prefault. I don't know of other users that use such huge memory sizes this would matter, besides a VMM.. > My assumption going into this series was that multithreading VFIO page > pinning in the kernel was a viable way forward given the positive > feedback I got from the VFIO maintainer last time I posted this, which > was admittedly a while ago, and I've since been focused on the other > parts of this series rather than what's been happening in the mm lately. > Anyway, your arguments are reasonable, so I'll go take a look at some of > these optimizations and see where I get. Well, it is not *unreasonable* it just doesn't seem compelling to me yet. Especially since we are not anywhere close to the limit of single threaded performance. Aside from GUP, the whole way we transfer the physical pages into the iommu is just begging for optimizations eg Matthew's struct phyr needs to be an input and output at the iommu layer to make this code really happy. How much time do we burn messing around in redundant iommu layer locking because everything is page at a time? Jason