Re: [PATCH v5 08/11] vduse: Implement an MMU-based IOMMU driver

Jason Wang <jasowang@xxxxxxxxxx> · Fri, 26 Mar 2021 15:36:19 +0800

在 2021/3/26 下午2:56, Yongji Xie 写道:
On Fri, Mar 26, 2021 at 2:16 PM Jason Wang <jasowang@xxxxxxxxxx> wrote:

在 2021/3/26 下午1:14, Yongji Xie 写道:

+     }
+     map->bounce_page = page;
+
+     /* paired with vduse_domain_map_page() */
+     smp_mb();

So this is suspicious. It's better to explain like, we need make sure A
must be done after B.

OK. I see. It's used to protect this pattern:

      vduse_domain_alloc_bounce_page:          vduse_domain_map_page:
      write map->bounce_page                           write map->orig_phys
      mb()                                                            mb()
      read map->orig_phys                                 read map->bounce_page

Make sure there will always be a path to do bouncing.

Ok.

And it looks to me the iotlb_lock is sufficnet to do the synchronization
here. E.g any reason that you don't take it in
vduse_domain_map_bounce_page().

Yes, we can. But the performance in multi-queue cases will go down if
we use iotlb_lock on this critical path.

And what's more, is there anyway to aovid holding the spinlock during
bouncing?

Looks like we can't. In the case that multiple page faults happen on
the same page, we should make sure the bouncing is done before any
page fault handler returns.

So it looks to me all those extra complexitiy comes from the fact that
the bounce_page and orig_phys are set by different places so we need to
do the bouncing in two places.

I wonder how much we can gain from the "lazy" boucning in page fault.
The buffer mapped via dma_ops from virtio driver is expected to be
accessed by the userspace soon.  It looks to me we can do all those
stuffs during dma_map() then things would be greatly simplified.

If so, we need to allocate lots of pages from the pool reserved for
atomic memory allocation requests.

This should be fine, a lot of drivers tries to allocate pages in atomic
context. The point is to simplify the codes to make it easy to
determince the correctness so we can add optimization on top simply by
benchmarking the difference.

OK. I will use this way in the next version.

E.g we have serveral places that accesses orig_phys:

1) map_page(), write
2) unmap_page(), write
3) page fault handler, read

It's not clear to me how they were synchronized. Or if it was
synchronzied implicitly (via iova allocator?), we'd better document it.

Yes.

Or simply use spinlock (which is the preferrable way I'd like to go). We
probably don't need to worry too much about the cost of spinlock since
iova allocater use it heavily.

Actually iova allocator implements a per-CPU cache to optimize it.

Thanks,
Yongji

Right, but have a quick glance, I guess what you meant is that usually there's no lock contention unless cpu hot-plug. This can work but the problem is that such synchornization depends on the internal implementation of IOVA allocator which is kind of fragile. I still think we should do that on our own.

I might miss something. Looks like we don't need any synchronization
if the page fault handler is removed as you suggested. We should not
access the same orig_phys concurrently (in map_page() and
unmap_page()) unless we free the iova before accessing.

Thanks,
Yongji

You're right. I overestimate the complexitiy that is required by the 
synchronization.

Thanks