On Tue, Feb 13, 2024 at 5:28 AM Pavel Begunkov <asml.silence@xxxxxxxxx> wrote: > > On 12/18/23 02:40, Mina Almasry wrote: > > Convert netmem to be a union of struct page and struct netmem. Overload > > the LSB of struct netmem* to indicate that it's a net_iov, otherwise > > it's a page. > > > > Currently these entries in struct page are rented by the page_pool and > > used exclusively by the net stack: > > > > struct { > > unsigned long pp_magic; > > struct page_pool *pp; > > unsigned long _pp_mapping_pad; > > unsigned long dma_addr; > > atomic_long_t pp_ref_count; > > }; > > > > Mirror these (and only these) entries into struct net_iov and implement > > netmem helpers that can access these common fields regardless of > > whether the underlying type is page or net_iov. > > Implement checks for net_iov in netmem helpers which delegate to mm > > APIs, to ensure net_iov are never passed to the mm stack. > > > > Signed-off-by: Mina Almasry <almasrymina@xxxxxxxxxx> > > > > --- > > > > RFCv5: > > - Use netmem instead of page* with LSB set. > > - Use pp_ref_count for refcounting net_iov. > > - Removed many of the custom checks for netmem. > > > > v1: > > - Disable fragmentation support for iov properly. > > - fix napi_pp_put_page() path (Yunsheng). > > - Use pp_frag_count for devmem refcounting. > > > > --- > > include/net/netmem.h | 145 ++++++++++++++++++++++++++++++-- > > include/net/page_pool/helpers.h | 25 +++--- > > net/core/page_pool.c | 26 +++--- > > net/core/skbuff.c | 9 +- > > 4 files changed, 164 insertions(+), 41 deletions(-) > > > > diff --git a/include/net/netmem.h b/include/net/netmem.h > > index 31f338f19da0..7557aecc0f78 100644 > > --- a/include/net/netmem.h > > +++ b/include/net/netmem.h > > @@ -12,11 +12,47 @@ > > > > /* net_iov */ > > > > +DECLARE_STATIC_KEY_FALSE(page_pool_mem_providers); > > + > > +/* We overload the LSB of the struct page pointer to indicate whether it's > > + * a page or net_iov. > > + */ > > +#define NET_IOV 0x01UL > > + > > struct net_iov { > > + unsigned long __unused_padding; > > + unsigned long pp_magic; > > + struct page_pool *pp; > > struct dmabuf_genpool_chunk_owner *owner; > > unsigned long dma_addr; > > + atomic_long_t pp_ref_count; > > }; > > I wonder if it would be better to extract a common sub-struct > used in struct page, struct_group_tagged can help to avoid > touching old code: > > struct page { > unsigned long flags; > union { > ... > struct_group_tagged(<struct_name>, ..., > /** > * @pp_magic: magic value to avoid recycling non > * page_pool allocated pages. > */ > unsigned long pp_magic; > struct page_pool *pp; > unsigned long _pp_mapping_pad; > unsigned long dma_addr; > atomic_long_t pp_ref_count; > ); > }; > } > > struct net_iov { > unsigned long pad; > struct <struct_name> p; > }; > > > A bit of a churn with the padding and nesting net_iov but looks > sturdier. No duplication, and you can just check positions of the > structure instead of per-field NET_IOV_ASSERT_OFFSET, which you > have to not forget to update e.g. when adding a new field. Also, Yes, this is nicer. If possible I'll punt it to a minor cleanup as a follow up change. Logistically I think if this series need-not touch code outside of net/, that's better. > with the change __netmem_clear_lsb can return a pointer to that > structure, casting struct net_iov when it's a page is a bit iffy. > > And the next question would be whether it'd be a good idea to encode > iov vs page not by setting a bit but via one of the fields in the > structure, maybe pp_magic. > I will push back against this, for 2 reasons: 1. I think pp_magic's first 2 bits (and maybe more) are used by mm code and thus I think extending usage of pp_magic in this series is a bit iffy and I would like to avoid it. I just don't want to touch the semantics of struct page if I don't have to. 2. I think this will be a measurable perf regression. Currently we can tell if a pointer is a page or net_iov without dereferencing the pointer and dirtying the cache-line. This will cause us to possibly dereference the pointer in areas where we don't need to. I think I had an earlier version of this code that required a dereference to tell if a page was devmem and Eric pointed to me it was a perf regression. I also don't see any upside of using pp_magic, other than making the code slightly more readable, maybe. > With that said I'm a bit concerned about the net_iov size. If each > represents 4096 bytes and you're registering 10MB, then you need > 30 pages worth of memory just for the iov array. Makes kvmalloc > a must even for relatively small sizes. > This I think is an age-old challenge with pages. 1.6% of the machine's memory is 'wasted' on every machine because a struct page needs to be allocated for each PAGE_SIZE region. We're running into the same issue here where if we want to refer to PAGE_SIZE regions of memory we need to allocate some reference to it. Note that net_iov can be relatively easily extended to support N order pages. Also note that in the devmem TCP use case it's not really an issue; the minor increase in mem utilization is more than offset by the saving in memory bw as compared to using host memory as a bounce buffer. All in all I vote this is something that can be tuned or improved in the future if someone finds the extra memory usage a hurdle to using devmem TCP or this net_iov infra. > And the final bit, I don't believe the overlay is necessary in > this series. Optimisations are great, but this one is a bit more on > the controversial side. Unless I missed something and it does make > things easier, it might make sense to do it separately later. > I completely agree, the overlay is not necessary. I implemented the overlay in response to Yunsheng's strong requests for more 'unified' processing between page and devmem. This is the most unification I can do IMO without violating the requirements from Jason. I'm prepared to remove the overlay if it turns out controversial, but so far I haven't seen any complaints. Jason, please do take a look if you have not already. > > > +/* These fields in struct page are used by the page_pool and net stack: > > + * > > + * struct { > > + * unsigned long pp_magic; > > + * struct page_pool *pp; > > + * unsigned long _pp_mapping_pad; > > + * unsigned long dma_addr; > > + * atomic_long_t pp_ref_count; > > + * }; > > + * > > + * We mirror the page_pool fields here so the page_pool can access these fields > > + * without worrying whether the underlying fields belong to a page or net_iov. > > + * > > + * The non-net stack fields of struct page are private to the mm stack and must > > + * never be mirrored to net_iov. > > + */ > > +#define NET_IOV_ASSERT_OFFSET(pg, iov) \ > > + static_assert(offsetof(struct page, pg) == \ > > + offsetof(struct net_iov, iov)) > > +NET_IOV_ASSERT_OFFSET(pp_magic, pp_magic); > > +NET_IOV_ASSERT_OFFSET(pp, pp); > > +NET_IOV_ASSERT_OFFSET(dma_addr, dma_addr); > > +NET_IOV_ASSERT_OFFSET(pp_ref_count, pp_ref_count); > > +#undef NET_IOV_ASSERT_OFFSET > > + > > static inline struct dmabuf_genpool_chunk_owner * > > net_iov_owner(const struct net_iov *niov) > > { > > @@ -47,19 +83,25 @@ net_iov_binding(const struct net_iov *niov) > > struct netmem { > > union { > > struct page page; > > - > > - /* Stub to prevent compiler implicitly converting from page* > > - * to netmem_t* and vice versa. > > - * > > - * Other memory type(s) net stack would like to support > > - * can be added to this union. > > - */ > > - void *addr; > > + struct net_iov niov; > > }; > > }; > > > ... > > -- > Pavel Begunkov -- Thanks, Mina