This patch series improves dmapool scalability by replacing linear scans with red-black trees. History: In 2018 this patch series made it through 4 versions. v1 used red-black trees; v2 - v4 put the dma pool info directly into struct page and used virt_to_page() to get at it. v4 made a brief appearance in linux-next, but it caused problems on non-x86 archs where virt_to_page() doesn't work with dma_alloc_coherent, so it was reverted. I was too busy at the time to repost the red-black tree version, and I forgot about it until now. This version is based on the red-black trees of v1, but addressing all the review comments I got at the time and with additional cleanup patches. Note that Keith Busch is also working on improving dmapool scalability, so for now I would recommend not merging my scalability patches until Keith's approach can be evaluated. In the meantime, my patches can serve as a benchmark comparison. I also have a number of cleanup patches in my series that could be useful on their own. References: v1 https://lore.kernel.org/linux-mm/73ec1f52-d758-05df-fb6a-41d269e910d0@xxxxxxxxxxxxxxx/ v2 https://lore.kernel.org/linux-mm/ec701153-fdc9-37f3-c267-f056159b4606@xxxxxxxxxxxxxxx/ v3 https://lore.kernel.org/linux-mm/d48854ff-995d-228e-8356-54c141c32117@xxxxxxxxxxxxxxx/ v4 https://lore.kernel.org/linux-mm/88395080-efc1-4e7b-f813-bb90c86d0745@xxxxxxxxxxxxxxx/ problem caused by virt_to_page() https://lore.kernel.org/linux-kernel/20181206013054.GI6707@xxxxxxxxxxx/ Keith Busch's dmapool performance enhancements https://lore.kernel.org/linux-mm/20220428202714.17630-1-kbusch@xxxxxxxxxx/ Below is my original description of the motivation for these patches. drivers/scsi/mpt3sas is running into a scalability problem with the kernel's DMA pool implementation. With a LSI/Broadcom SAS 9300-8i 12Gb/s HBA and max_sgl_entries=256, during modprobe, mpt3sas does the equivalent of: chain_dma_pool = dma_pool_create(size = 128); for (i = 0; i < 373959; i++) { dma_addr[i] = dma_pool_alloc(chain_dma_pool); } And at rmmod, system shutdown, or system reboot, mpt3sas does the equivalent of: for (i = 0; i < 373959; i++) { dma_pool_free(chain_dma_pool, dma_addr[i]); } dma_pool_destroy(chain_dma_pool); With this usage, both dma_pool_alloc() and dma_pool_free() exhibit O(n^2) complexity, although dma_pool_free() is much worse due to implementation details. On my system, the dma_pool_free() loop above takes about 9 seconds to run. Note that the problem was even worse before commit 74522a92bbf0 ("scsi: mpt3sas: Optimize I/O memory consumption in driver."), where the dma_pool_free() loop could take ~30 seconds. mpt3sas also has some other DMA pools, but chain_dma_pool is the only one with so many allocations: cat /sys/devices/pci0000:80/0000:80:07.0/0000:85:00.0/pools (manually cleaned up column alignment) poolinfo - 0.1 reply_post_free_array pool 1 21 192 1 reply_free pool 1 1 41728 1 reply pool 1 1 1335296 1 sense pool 1 1 970272 1 chain pool 373959 386048 128 12064 reply_post_free pool 12 12 166528 12 The patches in this series improve the scalability of the DMA pool implementation, which significantly reduces the running time of the DMA alloc/free loops. With the patches applied, "modprobe mpt3sas", "rmmod mpt3sas", and system shutdown/reboot with mpt3sas loaded are significantly faster. Here are some benchmarks (of DMA alloc/free only, not the entire modprobe/rmmod): dma_pool_create() + dma_pool_alloc() loop, size = 128, count = 373959 original: 350 ms ( 1x) dmapool patches: 18 ms (19x) dma_pool_free() loop + dma_pool_destroy(), size = 128, count = 373959 original: 8901 ms ( 1x) dmapool patches: 19 ms ( 477x)