On 2021/7/26 0:49, Alexander Duyck wrote: > On Sat, Jul 24, 2021 at 6:07 AM Yunsheng Lin <yunshenglin0825@xxxxxxxxx> wrote: >> >> On Fri, Jul 23, 2021 at 09:08:00AM -0700, Alexander Duyck wrote: >>> On Fri, Jul 23, 2021 at 4:12 AM Yunsheng Lin <linyunsheng@xxxxxxxxxx> wrote: >>>> >>>> On 2021/7/22 23:18, Alexander Duyck wrote: >>>>>>> >>>>>>>> You are right that that may cover up the reference count errors. How about >>>>>>>> something like below: >>>>>>>> >>>>>>>> static inline long page_pool_atomic_sub_frag_count_return(struct page *page, >>>>>>>> long nr) >>>>>>>> { >>>>>>>> #ifdef CONFIG_DEBUG_PAGE_REF >>>>>>>> long ret = atomic_long_sub_return(nr, &page->pp_frag_count); >>>>>>>> >>>>>>>> WARN_ON(ret < 0); >>>>>>>> >>>>>>>> return ret; >>>>>>>> #else >>>>>>>> if (atomic_long_read(&page->pp_frag_count) == nr) >>>>>>>> return 0; >>>>>>>> >>>>>>>> return atomic_long_sub_return(nr, &page->pp_frag_count); >>>>>>>> #end >>>>>>>> } >>>>>>>> >>>>>>>> Or any better suggestion? >>>>>>> >>>>>>> So the one thing I might change would be to make it so that you only >>>>>>> do the atomic_long_read if nr is a constant via __builtin_constant_p. >>>>>>> That way you would be performing the comparison in >>>>>>> __page_pool_put_page and in the cases of freeing or draining the >>>>>>> page_frags you would be using the atomic_long_sub_return which should >>>>>>> be paths where you would not expect it to match or that are slowpath >>>>>>> anyway. >>>>>>> >>>>>>> Also I would keep the WARN_ON in both paths just to be on the safe side. >>>>>> >>>>>> If I understand it correctly, we should change it as below, right? >>>>>> >>>>>> static inline long page_pool_atomic_sub_frag_count_return(struct page *page, >>>>>> long nr) >>>>>> { >>>>>> long ret; >>>>>> >>>>>> /* As suggested by Alexander, atomic_long_read() may cover up the >>>>>> * reference count errors, so avoid calling atomic_long_read() in >>>>>> * the cases of freeing or draining the page_frags, where we would >>>>>> * not expect it to match or that are slowpath anyway. >>>>>> */ >>>>>> if (__builtin_constant_p(nr) && >>>>>> atomic_long_read(&page->pp_frag_count) == nr) >>>>>> return 0; >>>>>> >>>>>> ret = atomic_long_sub_return(nr, &page->pp_frag_count); >>>>>> WARN_ON(ret < 0); >>>>>> return ret; >>>>>> } >>>>> >>>>> Yes, that is what I had in mind. >>>>> >>>>> One thought I had for a future optimization is that we could look at >>>>> reducing the count by 1 so that we could essentially combine the >>>>> non-frag and frag cases.Then instead of testing for 1 we would test >>>>> for 0 at thee start of the function and test for < 0 to decide if we >>>>> want to free it or not instead of testing for 0. With that we can >>>>> essentially reduce the calls to the WARN_ON since we should only have >>>>> one case where we actually return a value < 0, and we can then check >>>>> to see if we overshot -1 which would be the WARN_ON case. >>>>> >>>>> With that a value of 0 instead of 1 would indicate page frag is not in >>>>> use for the page *AND/OR* that the page has reached the state where >>>>> there are no other frags present so the page can be recycled. In >>>>> effect it would allow us to mix page frags and no frags within the >>>>> same pool. The added bonus would be we could get rid of the check for >>>>> PP_FLAG_PAGE_FRAG flag check in the __page_pool_put_page function and >>>>> replace it with a check for PAGE_POOL_DMA_USE_PP_FRAG_COUNT since we >>>>> cannot read frag_count in that case. >>>> >>>> Let's leave it for a future optimization. >>>> I am not sure if there is use case to support both frag page and non-frag >>>> page for the same page pool. If there is, maybe we can use "page->pp_frag_count >>>>> 0" to indicate that the page is frag page, and "page->pp_frag_count == 0" >>>> to indicate that the page is non-frag page, so that we can support frag page and >>>> non-frag page for the same page pool instead of disabling non-frag page support >>>> when PP_FLAG_PAGE_FRAG flag is set, which might be conflit with the above >>>> optimization? >>> >>> As far as use case I can see a number of potential uses. For example >>> in the case of drivers that do something like a header/data split I >>> could see potentially having the header pages be frags while the data >>> pages being 4K blocks. Basically the big optimization of the count == >>> 1/0/nr case is that you aren't increasing/decreasing the count and it >>> is immediately being recycled/reused. So in such a case being able to >>> add frag count some pages, and not to others would likely be quite >>> useful. >> >> I am not sure how the header/data split is implemented in hw, but it >> seems the driver is not able to tell which desc will be filled with >> header or data in advance, so it might need to allocate 4K block for >> all desc? > > It all depends on the hardware config. In theory you could have > anything from a single use for a page to multiple uses for a page in > the case of headers and/or packets being small. The overhead for > adding/removing the frag count could end up being more than what is > needed if the page is only used once. That is why I was thinking it > might make sense to allow both to coexist in the same pool. I am agreed that there may be usecase of using both frag page and non-frag page of the same page pool. Let's leave it for now. > >>> >>> Basically by shifting the pool values by 1 you can have both in the >>> same pool with little issue. However the big change is that instead of >>> testing for count = nr it would end up being pp_frag_count = nr - 1. >>> So in the case of the standard page pool pages being freed or the last >>> frag you would be looking at pp_frag_count = 0. In addition we can >>> mask the WARN_ON overhead as you would be using -1 as the point to >>> free so you would only have to perform the WARN_ON check for the last >>> frag instead of every frag. >> >> Yes, it seems doable. >> >>> >>>> Also, I am prototyping the tx recycling based on page pool in order to see >>>> if there is any value supporting the tx recycling. >>> >>> Just to clarify here when you say Tx recycling you are talking about >>> socket to netdev correct? Just want to be certain since the netdev to >>> netdev case should already have recycling for page pool pages as long >>> as it follows a 1<->1 path. >> >> Yes, the above Tx recycling meant socket to netdev. >> Also, the above "netdev to netdev" only meant XDP now, but not the IP >> forwarding path in the network stack, right? >> >>> >>>> As the busypoll has enable the one-to-one relation between NAPI and sock, >>>> and there is one-to-one relation between NAPI and page pool, perhaps it make >>>> senses that we use page pool to recycle the tx page too? >>>> >>>> There are possibly below problems when doing that as I am aware of now: >>>> 1. busypoll is for rx, and tx may not be using the same queue as rx even if >>>> there are *technically* the same flow, so I am not sure it is ok to use >>>> busypoll infrastructure to get the page pool ptr for a specific sock. >>>> >>>> 2. There may be multi socks using the same page pool ptr to allocate page for >>>> multi flow, so we can not assume the same NAPI polling protection as rx, >>>> which might mean we can only use the recyclable page from pool->ring under the >>>> r->consumer_lock protection. >>>> >>>> 3. Right now tcp_sendmsg_locked() use sk_page_frag_refill() to refill the page >>>> frag for tcp xmit, when implementing a similar sk_page_pool_frag_refill() >>>> based on page pool, I found that tcp coalesce in tcp_mtu_probe() and >>>> tcp fragment in tso_fragment() might mess with the page_ref_count directly. >>>> >>>> As the above the problem I am aware of(I believe there are other problems I am not >>>> aware of yet), I am not sure if the tcp tx page recycling based on page pool is >>>> doable or not, I would like to hear about your opinion about tcp tx recycling support >>>> based on page pool first, in case it is a dead end to support that. >>> >>> I'm honestly not sure there is much there to gain. Last I knew TCP was >>> using order 3 pages for transmitting and as a result the overhead for >>> the pages should already be greatly reduced. In addition one of the >>> main reasons for page_pool is the fact that the device has to DMA map >>> the page and that can have very high overhead on systems with an >>> IOMMU. >> >> Yes, avoiding the IOMMU overhead is the main gain. and "order 3 pages" >> seems to be disabled on defaut? >> >>> >>> Rather than trying to reuse the devices page pool it might make more >>> sense to see if you couldn't have TCP just use some sort of circular >>> buffer of memory that is directly mapped for the device that it is >>> going to be transmitting to. Essentially what you would be doing is >>> creating a pre-mapped page and would need to communicate that the >>> memory is already mapped for the device you want to send it to so that >>> it could skip that step. >> >> IIUC sk_page_frag_refill() is already doing a similar reusing as the >> rx reusing implemented in most driver except for the not pre-mapping >> part. >> >> And it seems that even if we pre-map the page and communicate that the >> memory is already mapped to the driver, it is likely that we will not >> be able to reuse the page when the circular buffer is not big enough >> or tx completion/tcp ack is not happening quickly enough, which might >> means unmapping/deallocating old circular buffer and allocating/mapping >> new circular buffer. >> >> Using page pool we might be able to alleviate the above problem as it >> does for rx? > > I would say that instead of looking at going straight for the page > pool it might make more sense to look at seeing if we can coalesce the > DMA mapping of the pages first at the socket layer rather than trying > to introduce the overhead for the page pool. In the case of sockets we > already have the destructors that are called when the memory is freed, > so instead of making sockets use page pool it might make more sense to > extend the socket buffer allocation/freeing to incorporate bulk > mapping and unmapping of pages to optimize the socket Tx path in the > 32K page case. I was able to enable tx recycling prototyping based on page pool to run some performance test, the performance improvement is about +20% (30Gbit -> 38Gbit) for single thread iperf tcp flow when IOMMU is in strict mode. And CPU usage descreases about 10% for four threads iperf tcp flow for line speed of 100Gbit when IOMMU is in strict mode. Looking at the prototyping code, I am agreed that it is a bit controversial to use the page pool for tx as the page pool is assuming NAPI polling protection for allocation side. So I will take a deeper look about your suggestion above to see how to implement it. Also, I am assuming the "destructors" means tcp_wfree() for TCP, right? It seems tcp_wfree() is mainly used to do memory accounting and free "struct sock" if necessary. I am not so familiar with socket layer to understand how the "destructors" will be helpful here, any detailed idea how to use "destructors" here? > . >