> Here's my quick thoughts at this late hour, though I might have > something else tomorrow. If the links are external to the dma addr > being freed, then I think you need to change the entire alloc/free API > to replace the dma_addr_t handle with a new type, like a tuple of > > { dma_addr_t, priv_list_link } > > That should make it possible to preserve the low constant time to alloc > and free in the critical section, which I think is a good thing to have. > I found 170 uses of dma_pool_alloc, and 360 dma_pool_free in the kernel, > so changing the API is no small task. :( Indeed, an API change like this might be the only way to move metadata out of the DMA blocks while maintaining its current performance. Regarding the performance hit of the submitted patches: - *Allocations* remain constant time (`O(1)`), as they simply check the `pool->next_block` pointer. - *Frees* are no longer constant time. Previously, `dma_pool_free()` converted a `vaddr` to its corresponding `block` by type casting, but now it calls `pool_find_block()`, which iterates over all pages (`O(n)`). Therefore, optimizing `dma_pool_free()` is key. While restructuring the API is the "best" solution, I implemented a compromise: introducing a `struct maple_tree block_map` field in `struct dma_pool` to save mappings of a `vaddr` to its corresponding `block`. A maple tree isn’t constant time, but it offers `O(log n)` performance, which is a significant improvement over the current `O(n)` iteration. Here are the performance results. I've already reported the first two sets of numbers, but I'll repeat them here: **Without no patches applied:** ``` dmapool test: size:16 align:16 blocks:8192 time:11860 dmapool test: size:64 align:64 blocks:8192 time:11951 dmapool test: size:256 align:256 blocks:8192 time:12287 dmapool test: size:1024 align:1024 blocks:2048 time:3134 dmapool test: size:4096 align:4096 blocks:1024 time:1686 dmapool test: size:68 align:32 blocks:8192 time:12050 ``` **With the submitted patches applied:** ``` dmapool test: size:16 align:16 blocks:8192 time:34432 dmapool test: size:64 align:64 blocks:8192 time:62262 dmapool test: size:256 align:256 blocks:8192 time:238137 dmapool test: size:1024 align:1024 blocks:2048 time:61386 dmapool test: size:4096 align:4096 blocks:1024 time:75342 dmapool test: size:68 align:32 blocks:8192 time:88243 ``` **With the submitted patches applied AND using a maple tree to improve the performance of vaddr-to-block translations:** ``` dmapool test: size:16 align:16 blocks:8192 time:43668 dmapool test: size:64 align:64 blocks:8192 time:44746 dmapool test: size:256 align:256 blocks:8192 time:45434 dmapool test: size:1024 align:1024 blocks:2048 time:11013 dmapool test: size:4096 align:4096 blocks:1024 time:5250 dmapool test: size:68 align:32 blocks:8192 time:45900 ``` The maple tree optimization reduces the performance hit for most block sizes, especially for larger blocks. While the performance is not fully back to baseline, it gives a reasonable trade-off between protection, runtime performance, and ease of deployment (i.e., not requiring an API change). If this approach looks acceptable, I can submit it as a V3 patch series for further review and discussion. Thanks, Brian Johannesmeyer