On Wed, Jul 20, 2022 at 10:47:25AM -0600, Alex Williamson wrote: > As I understand it more though, does the API really fit the expected use > cases? As presented here and used in the following patch, we map every > section of the user bitmap, present that section to the device driver > and ask them to mark dirty bits and atomically clear their internal > tracker for that sub-range. This seems really inefficient. I think until someone sits down and benchmarks it, it will be hard to really tell what is the rigtht trade offs are. pin_user_pages_fast() is fairly slow, so calling it once per 4k of user VA is definately worse than trying to call it once for 2M of user VA. On the other hand very very big guests are possibly likely to have 64GB regions where there are no dirties. But, sweeping the 64GB in the first place is possibly going to be slow, so saving a little bit of pin_user_pages time may not matter much. On the other hand, cases like vIOMMU will have huge swaths of IOVA where there just nothing mapped so perhaps sweeping for the system IOMMU will be fast and pin_user_pages overhead will be troublesome. Still, another view point is that returning a bitmap at all is really, ineffecient if we expect high sparsity and we should return dirty pfns and a simple put_user may be sufficient. It may make sense to have a 2nd API that works like this, userspace could call it during stop_copy on the assumption of high sparsity. We just don't have enough ecosystem going right now to sit down and do all this benchmarking works, so I was happy with the simplistic implementation here, it is only 160 lines, if we toss it later based on benchmarks no biggie. The important thing was that that this abstraction exist at all and that drivers don't do their own thing. Jason