On 2025/3/10 17:13, Toke Høiland-Jørgensen wrote: ... > >> Also, we might need a similar locking or synchronisation for the dma >> sync API in order to skip the dma sync API when page_pool_destroy() is >> called too. > > Good point, but that seems a separate issue? And simpler to solve (just If I understand the comment from DMA experts correctly, the dma_sync API should not be allowed to be called after the dma_unmap API. > set pool->dma_sync to false when destroying?). Without locking or synchronisation, there is some small window between pool->dma_sync checking and dma sync API calling, during which the driver might have already unbound. > >>> To avoid having to walk the entire xarray on unmap to find the page >>> reference, we stash the ID assigned by xa_alloc() into the page >>> structure itself, in the field previously called '_pp_mapping_pad' in >>> the page_pool struct inside struct page. This field overlaps with the >>> page->mapping pointer, which may turn out to be problematic, so an >>> alternative is probably needed. Sticking the ID into some of the upper >>> bits of page->pp_magic may work as an alternative, but that requires >>> further investigation. Using the 'mapping' field works well enough as >>> a demonstration for this RFC, though. >> page->pp_magic seems to only have 16 bits space available at most when >> trying to reuse some unused bits in page->pp_magic, as BPF_PTR_POISON >> and STACK_DEPOT_POISON seems to already use 16 bits, so: >> 1. For 32 bits system, it seems there is only 16 bits left even if the >> ILLEGAL_POINTER_VALUE is defined as 0x0. >> 2. For 64 bits system, it seems there is only 12 bits left for powerpc >> as ILLEGAL_POINTER_VALUE is defined as 0x5deadbeef0000000, see >> arch/powerpc/Kconfig. >> >> So it seems we might need to limit the dma mapping count to 4096 or >> 65536? >> >> And to be honest, I am not sure if those 'unused' 12/16 bits can really >> be reused or not, I guess we might need suggestion from mm and arch >> experts here. > > Why do we need to care about BPF_PTR_POISON and STACK_DEPOT_POISON? > AFAICT, we only need to make sure we preserve the PP_SIGNATURE value. > See v2 of the RFC patch, the bit arithmetic there gives me: > > - 24 bits on 32-bit architectures > - 21 bits on PPC64 (because of the definition of ILLEGAL_POINTER_VALUE) > - 32 bits on other 64-bit architectures > > Which seems to be plenty? I am really doubtful it is that simple, but we always can hear from the experts if it isn't:) > >>> Since all the tracking is performed on DMA map/unmap, no additional code >>> is needed in the fast path, meaning the performance overhead of this >>> tracking is negligible. The extra memory needed to track the pages is >>> neatly encapsulated inside xarray, which uses the 'struct xa_node' >>> structure to track items. This structure is 576 bytes long, with slots >>> for 64 items, meaning that a full node occurs only 9 bytes of overhead >>> per slot it tracks (in practice, it probably won't be this efficient, >>> but in any case it should be an acceptable overhead). >> >> Even if items is stored sequentially in xa_node at first, is it possible >> that there may be fragmentation in those xa_node when pages are released >> and allocated many times during packet processing? If yes, is there any >> fragmentation info about those xa_node? > > Some (that's what I mean with "not as efficient"). AFAICT, xa_array does > do some rebalancing of the underlying radix tree, freeing nodes when > they are no longer used. However, I am not too familiar with the xarray > code, so I don't know exactly how efficient this is in practice. I guess that is one of the disadvantages that an advanced struct like Xarray is used:( >