On 2024/11/6 23:57, Jesper Dangaard Brouer wrote: ... >> >> Some more info from production servers. >> >> (I'm amazed what we can do with a simple bpftrace script, Cc Viktor) >> >> In below bpftrace script/oneliner I'm extracting the inflight count, for >> all page_pool's in the system, and storing that in a histogram hash. >> >> sudo bpftrace -e ' >> rawtracepoint:page_pool_state_release { @cnt[probe]=count(); >> @cnt_total[probe]=count(); >> $pool=(struct page_pool*)arg0; >> $release_cnt=(uint32)arg2; >> $hold_cnt=$pool->pages_state_hold_cnt; >> $inflight_cnt=(int32)($hold_cnt - $release_cnt); >> @inflight=hist($inflight_cnt); >> } >> interval:s:1 {time("\n%H:%M:%S\n"); >> print(@cnt); clear(@cnt); >> print(@inflight); >> print(@cnt_total); >> }' >> >> The page_pool behavior depend on how NIC driver use it, so I've run this on two prod servers with drivers bnxt and mlx5, on a 6.6.51 kernel. >> >> Driver: bnxt_en >> - kernel 6.6.51 >> >> @cnt[rawtracepoint:page_pool_state_release]: 8447 >> @inflight: >> [0] 507 | | >> [1] 275 | | >> [2, 4) 261 | | >> [4, 8) 215 | | >> [8, 16) 259 | | >> [16, 32) 361 | | >> [32, 64) 933 | | >> [64, 128) 1966 | | >> [128, 256) 937052 |@@@@@@@@@ | >> [256, 512) 5178744 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| >> [512, 1K) 73908 | | >> [1K, 2K) 1220128 |@@@@@@@@@@@@ | >> [2K, 4K) 1532724 |@@@@@@@@@@@@@@@ | >> [4K, 8K) 1849062 |@@@@@@@@@@@@@@@@@@ | >> [8K, 16K) 1466424 |@@@@@@@@@@@@@@ | >> [16K, 32K) 858585 |@@@@@@@@ | >> [32K, 64K) 693893 |@@@@@@ | >> [64K, 128K) 170625 |@ | >> >> Driver: mlx5_core >> - Kernel: 6.6.51 >> >> @cnt[rawtracepoint:page_pool_state_release]: 1975 >> @inflight: >> [128, 256) 28293 |@@@@ | >> [256, 512) 184312 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ | >> [512, 1K) 0 | | >> [1K, 2K) 4671 | | >> [2K, 4K) 342571 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| >> [4K, 8K) 180520 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ | >> [8K, 16K) 96483 |@@@@@@@@@@@@@@ | >> [16K, 32K) 25133 |@@@ | >> [32K, 64K) 8274 |@ | >> >> >> The key thing to notice that we have up-to 128,000 pages in flight on >> these random production servers. The NIC have 64 RX queue configured, >> thus also 64 page_pool objects. >> > > I realized that we primarily want to know the maximum in-flight pages. > > So, I modified the bpftrace oneliner to track the max for each page_pool in the system. > > sudo bpftrace -e ' > rawtracepoint:page_pool_state_release { @cnt[probe]=count(); > @cnt_total[probe]=count(); > $pool=(struct page_pool*)arg0; > $release_cnt=(uint32)arg2; > $hold_cnt=$pool->pages_state_hold_cnt; > $inflight_cnt=(int32)($hold_cnt - $release_cnt); > $cur=@inflight_max[$pool]; > if ($inflight_cnt > $cur) { > @inflight_max[$pool]=$inflight_cnt;} > } > interval:s:1 {time("\n%H:%M:%S\n"); > print(@cnt); clear(@cnt); > print(@inflight_max); > print(@cnt_total); > }' > > I've attached the output from the script. > For unknown reason this system had 199 page_pool objects. Perhaps some of those page_pool objects are per_cpu page_pool objects from net_page_pool_create()? It would be good if the pool_size for those page_pool objects is printed too. > > The 20 top users: > > $ cat out02.inflight-max | grep inflight_max | tail -n 20 > @inflight_max[0xffff88829133d800]: 26473 > @inflight_max[0xffff888293c3e000]: 27042 > @inflight_max[0xffff888293c3b000]: 27709 > @inflight_max[0xffff8881076f2800]: 29400 > @inflight_max[0xffff88818386e000]: 29690 > @inflight_max[0xffff8882190b1800]: 29813 > @inflight_max[0xffff88819ee83800]: 30067 > @inflight_max[0xffff8881076f4800]: 30086 > @inflight_max[0xffff88818386b000]: 31116 > @inflight_max[0xffff88816598f800]: 36970 > @inflight_max[0xffff8882190b7800]: 37336 > @inflight_max[0xffff888293c38800]: 39265 > @inflight_max[0xffff888293c3c800]: 39632 > @inflight_max[0xffff888293c3b800]: 43461 > @inflight_max[0xffff888293c3f000]: 43787 > @inflight_max[0xffff88816598f000]: 44557 > @inflight_max[0xffff888132ce9000]: 45037 > @inflight_max[0xffff888293c3f800]: 51843 > @inflight_max[0xffff888183869800]: 62612 > @inflight_max[0xffff888113d08000]: 73203 > > Adding all values together: > > grep inflight_max out02.inflight-max | awk 'BEGIN {tot=0} {tot+=$2; printf "total:" tot "\n"}' | tail -n 1 > > total:1707129 > > Worst case we need a data structure holding 1,707,129 pages. For 64 bit system, that means about 54MB memory overhead for tracking those inflight pages if 16 byte memory of metadata needed for each page, I guess that is ok for those large systems. > Fortunately, we don't need a single data structure as this will be split > between 199 page_pool's. It would be good to have an average value for the number of inflight pages, so that we might be able to have a statically allocated memory to satisfy the mostly used case, and use the dynamically allocated memory if/when necessary. > > --Jesper