On 12/8/23 23:25, Mina Almasry wrote:
On Fri, Dec 8, 2023 at 2:56 PM Pavel Begunkov <asml.silence@xxxxxxxxx> wrote:
On 12/8/23 00:52, Mina Almasry wrote:
...
+ if (pool->p.queue)
+ binding = READ_ONCE(pool->p.queue->binding);
+
+ if (binding) {
+ pool->mp_ops = &dmabuf_devmem_ops;
+ pool->mp_priv = binding;
+ }
Hmm, I don't understand why would we replace a nice transparent
api with page pool relying on a queue having devmem specific
pointer? It seemed more flexible and cleaner in the last RFC.
Jakub requested this change and may chime in, but I suspect it's to
further abstract the devmem changes from driver. In this iteration,
the driver grabs the netdev_rx_queue and passes it to the page_pool,
and any future configurations between the net stack and page_pool can
be passed this way with the driver unbothered.
Ok, that makes sense, but even if passed via an rx queue I'd
at least hope it keeping abstract provider parameters, e.g.
ops, but not hard coded with devmem specific code.
It might even be better done with a helper like
create_page_pool_from_queue(), unless there is some deeper
interaction b/w pp and rx queues is predicted.
+
if (pool->mp_ops) {
err = pool->mp_ops->init(pool);
if (err) {
@@ -1020,3 +1033,77 @@ void page_pool_update_nid(struct page_pool *pool, int new_nid)
}
}
EXPORT_SYMBOL(page_pool_update_nid);
+
+void __page_pool_iov_free(struct page_pool_iov *ppiov)
+{
+ if (WARN_ON(ppiov->pp->mp_ops != &dmabuf_devmem_ops))
+ return;
+
+ netdev_free_dmabuf(ppiov);
+}
+EXPORT_SYMBOL_GPL(__page_pool_iov_free);
I didn't look too deep but I don't think I immediately follow
the pp refcounting. It increments pages_state_hold_cnt on
allocation, but IIUC doesn't mark skbs for recycle? Then, they all
will be put down via page_pool_iov_put_many() bypassing
page_pool_return_page() and friends. That will call
netdev_free_dmabuf(), which doesn't bump pages_state_release_cnt.
At least I couldn't make it work with io_uring, and for my purposes,
I forced all puts to go through page_pool_return_page(), which calls
the ->release_page callback. The callback will put the reference and
ask its page pool to account release_cnt. It also gets rid of
__page_pool_iov_free(), as we'd need to add a hook there for
customization otherwise.
I didn't care about overhead because the hot path for me is getting
buffers from a ring, which is somewhat analogous to sock_devmem_dontneed(),
but done on pp allocations under napi, and it's done separately.
Completely untested with TCP devmem:
https://github.com/isilence/linux/commit/14bd56605183dc80b540999e8058c79ac92ae2d8
This was a mistake in the last RFC, which should be fixed in v1. In
the RFC I was not marking the skbs as skb_mark_for_recycle(), so the
unreffing path wasn't as expected.
In this iteration, that should be completely fixed. I suspect since I
just posted this you're actually referring to the issue tested on the
last RFC? Correct me if wrong.
Right, it was with RFCv3
In this iteration, the reffing story:
- memory provider allocs ppiov and returns it to the page pool with
ppiov->refcount == 1.
- The page_pool gives the page to the driver. The driver may
obtain/release references with page_pool_page_[get|put]_many(), but
the driver is likely not doing that unless it's doing its own page
recycling.
- The net stack obtains references via skb_frag_ref() ->
page_pool_page_get_many()
- The net stack drops references via skb_frag_unref() ->
napi_pp_put_page() -> page_pool_return_page() and friends.
Thus, the issue where the unref path was skipping
page_pool_return_page() and friends should be resolved in this
iteration, let me know if you think otherwise, but I think this was an
issue limited to the last RFC.
Then page_pool_iov_put_many() should and supposedly would never be
called by non devmap code because all puts must circle back into
->release_page. Why adding it to into page_pool_page_put_many()?
@@ -731,6 +731,29 @@ __page_pool_put_page(struct page_pool *pool, struct page *page,
+ if (page_is_page_pool_iov(page)) {
...
+ page_pool_page_put_many(page, 1);
+ return NULL;
+ }
Well, I'm looking at this new branch from Patch 10, it can put
the buffer, but what if we race at it's actually the final put?
Looks like nobody is going to to bump up pages_state_release_cnt
If you remove the branch, let it fall into ->release and rely
on refcounting there, then the callback could also fix up
release_cnt or ask pp to do it, like in the patch I linked above
--
Pavel Begunkov