Re: [PATCH vhost v2 4/7] virtio_net: big mode support premapped

Jason Wang <jasowang@xxxxxxxxxx> · Wed, 24 Apr 2024 11:50:44 +0800

On Wed, Apr 24, 2024 at 10:58 AM Xuan Zhuo <xuanzhuo@xxxxxxxxxxxxxxxxx> wrote:
>
> On Wed, 24 Apr 2024 10:45:49 +0800, Jason Wang <jasowang@xxxxxxxxxx> wrote:
> > On Wed, Apr 24, 2024 at 10:42 AM Xuan Zhuo <xuanzhuo@xxxxxxxxxxxxxxxxx> wrote:
> > >
> > > On Wed, 24 Apr 2024 10:34:56 +0800, Jason Wang <jasowang@xxxxxxxxxx> wrote:
> > > > On Wed, Apr 24, 2024 at 9:10 AM Xuan Zhuo <xuanzhuo@xxxxxxxxxxxxxxxxx> wrote:
> > > > >
> > > > > On Wed, 24 Apr 2024 08:43:21 +0800, Jason Wang <jasowang@xxxxxxxxxx> wrote:
> > > > > > On Tue, Apr 23, 2024 at 8:38 PM Xuan Zhuo <xuanzhuo@xxxxxxxxxxxxxxxxx> wrote:
> > > > > > >
> > > > > > > On Tue, 23 Apr 2024 12:36:42 +0800, Jason Wang <jasowang@xxxxxxxxxx> wrote:
> > > > > > > > On Mon, Apr 22, 2024 at 3:24 PM Xuan Zhuo <xuanzhuo@xxxxxxxxxxxxxxxxx> wrote:
> > > > > > > > >
> > > > > > > > > In big mode, pre-mapping DMA is beneficial because if the pages are not
> > > > > > > > > used, we can reuse them without needing to unmap and remap.
> > > > > > > > >
> > > > > > > > > We require space to store the DMA address. I use the page.dma_addr to
> > > > > > > > > store the DMA address from the pp structure inside the page.
> > > > > > > > >
> > > > > > > > > Every page retrieved from get_a_page() is mapped, and its DMA address is
> > > > > > > > > stored in page.dma_addr. When a page is returned to the chain, we check
> > > > > > > > > the DMA status; if it is not mapped (potentially having been unmapped),
> > > > > > > > > we remap it before returning it to the chain.
> > > > > > > > >
> > > > > > > > > Based on the following points, we do not use page pool to manage these
> > > > > > > > > pages:
> > > > > > > > >
> > > > > > > > > 1. virtio-net uses the DMA APIs wrapped by virtio core. Therefore,
> > > > > > > > >    we can only prevent the page pool from performing DMA operations, and
> > > > > > > > >    let the driver perform DMA operations on the allocated pages.
> > > > > > > > > 2. But when the page pool releases the page, we have no chance to
> > > > > > > > >    execute dma unmap.
> > > > > > > > > 3. A solution to #2 is to execute dma unmap every time before putting
> > > > > > > > >    the page back to the page pool. (This is actually a waste, we don't
> > > > > > > > >    execute unmap so frequently.)
> > > > > > > > > 4. But there is another problem, we still need to use page.dma_addr to
> > > > > > > > >    save the dma address. Using page.dma_addr while using page pool is
> > > > > > > > >    unsafe behavior.
> > > > > > > > >
> > > > > > > > > More:
> > > > > > > > >     https://lore.kernel.org/all/CACGkMEu=Aok9z2imB_c5qVuujSh=vjj1kx12fy9N7hqyi+M5Ow@xxxxxxxxxxxxxx/
> > > > > > > > >
> > > > > > > > > Signed-off-by: Xuan Zhuo <xuanzhuo@xxxxxxxxxxxxxxxxx>
> > > > > > > > > ---
> > > > > > > > >  drivers/net/virtio_net.c | 123 ++++++++++++++++++++++++++++++++++-----
> > > > > > > > >  1 file changed, 108 insertions(+), 15 deletions(-)
> > > > > > > > >
> > > > > > > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > > > > > > > > index 2c7a67ad4789..d4f5e65b247e 100644
> > > > > > > > > --- a/drivers/net/virtio_net.c
> > > > > > > > > +++ b/drivers/net/virtio_net.c
> > > > > > > > > @@ -439,6 +439,81 @@ skb_vnet_common_hdr(struct sk_buff *skb)
> > > > > > > > >         return (struct virtio_net_common_hdr *)skb->cb;
> > > > > > > > >  }
> > > > > > > > >
> > > > > > > > > +static void sg_fill_dma(struct scatterlist *sg, dma_addr_t addr, u32 len)
> > > > > > > > > +{
> > > > > > > > > +       sg->dma_address = addr;
> > > > > > > > > +       sg->length = len;
> > > > > > > > > +}
> > > > > > > > > +
> > > > > > > > > +/* For pages submitted to the ring, we need to record its dma for unmap.
> > > > > > > > > + * Here, we use the page.dma_addr and page.pp_magic to store the dma
> > > > > > > > > + * address.
> > > > > > > > > + */
> > > > > > > > > +static void page_chain_set_dma(struct page *p, dma_addr_t addr)
> > > > > > > > > +{
> > > > > > > > > +       if (sizeof(dma_addr_t) > sizeof(unsigned long)) {
> > > > > > > >
> > > > > > > > Need a macro like PAGE_POOL_32BIT_ARCH_WITH_64BIT_DMA.
> > > > > > > >
> > > > > > > > > +               p->dma_addr = lower_32_bits(addr);
> > > > > > > > > +               p->pp_magic = upper_32_bits(addr);
> > > > > > > >
> > > > > > > > And this uses three fields on page_pool which I'm not sure the other
> > > > > > > > maintainers are happy with. For example, re-using pp_maing might be
> > > > > > > > dangerous. See c07aea3ef4d40 ("mm: add a signature in struct page").
> > > > > > > >
> > > > > > > > I think a more safe way is to reuse page pool, for example introducing
> > > > > > > > a new flag with dma callbacks?
> > > > > > >
> > > > > > > If we use page pool, how can we chain the pages allocated for a packet?
> > > > > >
> > > > > > I'm not sure I get this, it is chained via the descriptor flag.
> > > > >
> > > > >
> > > > > In the big mode, we will commit many pages to the virtio core by
> > > > > virtqueue_add_inbuf().
> > > > >
> > > > > By virtqueue_get_buf_ctx(), we got the data. That is the first page.
> > > > > Other pages are chained by the "private".
> > > > >
> > > > > If we use the page pool, how can we chain the pages.
> > > > > After virtqueue_add_inbuf(), we need to get the pages to fill the skb.
> > > >
> > > > Right, technically it could be solved by providing helpers in the
> > > > virtio core, but considering it's an optimization for big mode which
> > > > is not popular, it's not worth to bother.
> > > >
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > Yon know the "private" can not be used.
> > > > > > >
> > > > > > >
> > > > > > > If the pp struct inside the page is not safe, how about:
> > > > > > >
> > > > > > >                 struct {        /* Page cache and anonymous pages */
> > > > > > >                         /**
> > > > > > >                          * @lru: Pageout list, eg. active_list protected by
> > > > > > >                          * lruvec->lru_lock.  Sometimes used as a generic list
> > > > > > >                          * by the page owner.
> > > > > > >                          */
> > > > > > >                         union {
> > > > > > >                                 struct list_head lru;
> > > > > > >
> > > > > > >                                 /* Or, for the Unevictable "LRU list" slot */
> > > > > > >                                 struct {
> > > > > > >                                         /* Always even, to negate PageTail */
> > > > > > >                                         void *__filler;
> > > > > > >                                         /* Count page's or folio's mlocks */
> > > > > > >                                         unsigned int mlock_count;
> > > > > > >                                 };
> > > > > > >
> > > > > > >                                 /* Or, free page */
> > > > > > >                                 struct list_head buddy_list;
> > > > > > >                                 struct list_head pcp_list;
> > > > > > >                         };
> > > > > > >                         /* See page-flags.h for PAGE_MAPPING_FLAGS */
> > > > > > >                         struct address_space *mapping;
> > > > > > >                         union {
> > > > > > >                                 pgoff_t index;          /* Our offset within mapping. */
> > > > > > >                                 unsigned long share;    /* share count for fsdax */
> > > > > > >                         };
> > > > > > >                         /**
> > > > > > >                          * @private: Mapping-private opaque data.
> > > > > > >                          * Usually used for buffer_heads if PagePrivate.
> > > > > > >                          * Used for swp_entry_t if PageSwapCache.
> > > > > > >                          * Indicates order in the buddy system if PageBuddy.
> > > > > > >                          */
> > > > > > >                         unsigned long private;
> > > > > > >                 };
> > > > > > >
> > > > > > > Or, we can map the private space of the page as a new structure.
> > > > > >
> > > > > > It could be a way. But such allocation might be huge if we are using
> > > > > > indirect descriptors or I may miss something.
> > > > >
> > > > > No. we only need to store the "chain next" and the dma as this patch set did.
> > > > > The size of the private space inside the page is  20(32bit)/40(64bit) bytes.
> > > > > That is enough for us.
> > > > >
> > > > > If you worry about the change of the pp structure, we can use the "private" as
> > > > > origin and use the "struct list_head lru" to store the dma.
> > > >
> > > > This looks even worse, as it uses fields belonging to the different
> > > > structures in the union.
> > >
> > > I mean we do not use the elems from the pp structure inside the page,
> > > if we worry the change of the pp structure.
> > >
> > > I mean use the "private" and "lru", these in the same structure.
> > >
> > > I think this is a good way.
> > >
> > > Thanks.
> >
> > See this:
> >
> > https://lore.kernel.org/netdev/20210411114307.5087f958@carbon/
>
>
> I think that is because that the page pool will share the page with
> the skbs.  I'm not entirely sure.
>
> In our case, virtio-net fully owns the page. After the page is referenced by skb,
> virtio-net no longer references the page. I don't think there is any problem
> here.

Well, in the rx path, though the page is allocated by the virtio-net,
unlike the page pool, those pages are not freed by virtio-net. So it
may leave things in the page structure which is problematic. I don't
think we can introduce a virtio-net specific hook for kfree_skb() in
this case. That's why I think leveraging the page pool is better.

For reusing page pool. Maybe we can reuse __pp_mapping_pad for
virtio-net specific use cases like chaining, and clear it in
page_pool_clear_pp_info(). And we need to make sure we don't break
things like TCP RX zerocopy since mapping is aliasied with
__pp_mapping_pad at a first glance.

>
> The key is that who owns the page, who can use the page private space (20/40 bytes).
>
> Is that?

I'm not saying we can't investigate in this direction. But it needs
more comments from mm guys and we need to evaluate the price we pay
for that.

The motivation is to drop the fallback code when pre mapping is not
supported to improve the maintainability of the code and ease the
AF_XDP support for virtio-net. But it turns out to be not easy.

Considering the rx fallback code we need to maintain is not too huge,
maybe we can leave it as is, for example forbid AF_XDP in big modes.

Thanks

>
> Thanks.
>
>
> >
> > Thanks
> >
>