On Wed, Jun 1, 2022 at 5:33 AM 愚树 <chen45464546@xxxxxxx> wrote: > > At 2022-06-01 01:28:59, "Alexander Duyck" <alexander.duyck@xxxxxxxxx> wrote: > >On Tue, May 31, 2022 at 8:47 AM Jakub Kicinski <kuba@xxxxxxxxxx> wrote: > >> > >> On Tue, 31 May 2022 23:36:22 +0800 Chen Lin wrote: > >> > At 2022-05-31 22:14:12, "Jakub Kicinski" <kuba@xxxxxxxxxx> wrote: > >> > >On Tue, 31 May 2022 22:41:12 +0800 Chen Lin wrote: > >> > >> The sample code above cannot completely solve the current problem. > >> > >> For example, when fragsz is greater than PAGE_FRAG_CACHE_MAX_SIZE(32768), > >> > >> __page_frag_cache_refill will return a memory of only 32768 bytes, so > >> > >> should we continue to expand the PAGE_FRAG_CACHE_MAX_SIZE? Maybe more > >> > >> work needs to be done > >> > > > >> > >Right, but I can think of two drivers off the top of my head which will > >> > >allocate <=32k frags but none which will allocate more. > >> > > >> > In fact, it is rare to apply for more than one page, so is it necessary to > >> > change it to support? > >> > >> I don't really care if it's supported TBH, but I dislike adding > >> a branch to the fast path just to catch one or two esoteric bad > >> callers. > >> > >> Maybe you can wrap the check with some debug CONFIG_ so it won't > >> run on production builds? > > > >Also the example used here to define what is triggering the behavior > >is seriously flawed. The code itself is meant to allow for order0 page > >reuse, and the 32K page was just an optimization. So the assumption > >that you could request more than 4k is a bad assumption in the driver > >that is making this call. > > > >So I am in agreement with Kuba. We shouldn't be needing to add code in > >the fast path to tell users not to shoot themselves in the foot. > > > >We already have code in place in __netdev_alloc_skb that is calling > >the slab allocator if "len > SKB_WITH_OVERHEAD(PAGE_SIZE)". We could > >probably just add a DEBUG wrapped BUG_ON to capture those cases where > >a driver is making that mistake with __netdev_alloc_frag_align. > > Thanks for the clear explanation. > The reality is that it is not easy to capture the drivers that make such mistake. > Because memory corruption usually leads to errors on other unrelated modules. > Not long ago, we have spent a lot of time and effort to locate a issue that > occasionally occurs in different kernel modules, and finally find the root cause is > the improper use of this netdev_alloc_frag interface in DPAA net driver from NXP. > It's a miserable process. > > I also found that some net drivers in the latest Linux version have this issue. > Like: > 1. netdev_alloc_frag "len" may larger than PAGE_SIZE > #elif (PAGE_SIZE >= E1000_RXBUFFER_4096) > adapter->rx_buffer_len = PAGE_SIZE; > #endif > > static unsigned int e1000_frag_len(const struct e1000_adapter *a) > { > return SKB_DATA_ALIGN(a->rx_buffer_len + E1000_HEADROOM) + > SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); > } > > static void *e1000_alloc_frag(const struct e1000_adapter *a) > { > unsigned int len = e1000_frag_len(a); > u8 *data = netdev_alloc_frag(len); > } > "./drivers/net/ethernet/intel/e1000/e1000_main.c" 5316 --38%-- So there isn't actually a bug in this code. Specifically the code is split up between two paths. The first code block comes from the jumbo frames path which creates a fraglist skb and will memcpy the header out if I recall correctly. The code from the other two functions is from the non-jumbo frames path which has restricted the length to MAXIMUM_ETHERNET_VLAN_SIZE. > 2. netdev_alloc_frag "ring->frag_size" may larger than (4096 * 3) > > #define MTK_MAX_LRO_RX_LENGTH (4096 * 3) > if (rx_flag == MTK_RX_FLAGS_HWLRO) { > rx_data_len = MTK_MAX_LRO_RX_LENGTH; > rx_dma_size = MTK_HW_LRO_DMA_SIZE; > } else { > rx_data_len = ETH_DATA_LEN; > rx_dma_size = MTK_DMA_SIZE; > } > > ring->frag_size = mtk_max_frag_size(rx_data_len); > > for (i = 0; i < rx_dma_size; i++) { > ring->data[i] = netdev_alloc_frag(ring->frag_size); > if (!ring->data[i]) > return -ENOMEM; > } > "drivers/net/ethernet/mediatek/mtk_eth_soc.c" 3344 --50%-- > > I will try to fix these drivers later. This one I don't know as much about, and it does appear to contain a bug. What it should be doing is a check before doing the netdev_alloc_frag call to verify if it is less than 4K then it uses netdev_alloc_frag, if it is greater then it needs to use alloc_pages. > Even experienced driver engineers may use this netdev_alloc_frag > interface incorrectly. > So I thought it is best to provide some prompt information of usage > error inside the netdev_alloc_frag, or it's OK to report such mistake > during system running which may caused by fragsz varies(exceeded page size). > > Now, as you and Kuba mentioned earlier, "do not add code in fast path". > > Can we just add code to the relatively slow path to capture the mistake > before it lead to memory corruption? > Like: > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index e6f211d..ac60a97 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -5580,6 +5580,7 @@ void *page_frag_alloc_align(struct page_frag_cache *nc, > /* reset page count bias and offset to start of new frag */ > nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1; > offset = size - fragsz; > + BUG_ON(offset < 0); > } > > nc->pagecnt_bias--; > I think I could be onboard with a patch like this. The test shouldn't add more than 1 instruction since it is essentially just a jump if signed test which will be performed after the size - fragsz check. > Additional, we may modify document to clearly indicate the limits of the > input parameter fragsz. > Like: > diff --git a/Documentation/vm/page_frags.rst b/Documentation/vm/page_frags.rst > index 7d6f938..61b2805 100644 > --- a/Documentation/vm/page_frags.rst > +++ b/Documentation/vm/page_frags.rst > @@ -4,7 +4,7 @@ > Page fragments > ============== > > -A page fragment is an arbitrary-length arbitrary-offset area of memory > +A page fragment is an arbitrary-length(must <= PAGE_SIZE) arbitrary-offset area of memory > which resides within a 0 or higher order compound page. The main thing I would call out about the page fragment is that it should be less than an order 0 page in size, ideally at least half a page to allow for reuse even in the case of order 0 pages. Otherwise it is really an abuse of the interface as it isn't really meant to be allocating 1 fragment per page since the efficiency will drop pretty significantly as memory becomes fragmented and it becomes harder to allocate higher order pages. It would essentially just become alloc_page with more overhead.