was: (Project: Improving the PCP allocator) On Mon, Jan 22, 2024 at 04:01:34PM +0000, Matthew Wilcox wrote: > As I mentioned here [1], I have Thoughts on how the PCP allocator > works in a memdesc world. Unlike my earlier Thoughts on the buddy > allocator [2], we can actually make progress towards this one (and > see substantial performance improvement, I believe). So it's ripe for > someone to pick up. > > == With memdescs == > > When we have memdescs, allocating a folio from the buddy is a two step > process. First we allocate the struct folio from slab, then we ask the > buddy allocator for 2^n pages, each of which gets its memdesc set to > point to this folio. It'll be similar for other memory descriptors, > but let's keep it simple and just talk about folios for now. > > Usually when we free folios, it's due to memory pressure (yes, we'll free > memory due to truncating a file or processes exiting and freeing their > anonymous memory, but that's secondary). That means we're likely to want > to allocate a folio again soon. Given that, returning the struct folio > to the slab allocator seems like a waste of time. The PCP allocator > can hold onto the struct folio as well as the underlying memory and > then just hand it back to the next caller of folio_alloc. This also > saves us from having to invent a 'struct pcpdesc' and swap the memdesc > pointer from the folio to the pcpdesc. > > This implies that we no longer have a single pcp allocator for all types > of memory; rather we have one for each memdesc type. I think that's > going to be OK, but it might introduce some problems. > > == Before memdescs == > > Today we take all comers on the PCP list. __free_pages() calls > free_the_page() calls free_unref_page() calls free_unref_page_prepare() > calls free_pages_prepare() which undoes all the PageCompound work. > > Most multi-page allocations are compound. Slab, file, anon; it's all > compound. I propose that we _only_ keep compound memory on the PCP list. > Freeing non-compound multi-page memory can either convert it into compound > pages before being placed on the PCP list or just hand the memory back > to the buddy allocator. Non-compound multi-page allocations can either > go straight to buddy or grab from the PCP list and undo the compound > nature of the pages. > > I think this could be a huge saving. Consider allocating an order-9 PMD > sized THP. Today we initialise compound_head in each of the 511 tail > pages. Since a page is 64 bytes, we touch 32kB of memory! That's 2/3 of > my CPU's L1 D$, so it's just pushed out a good chunk of my working set. > And it's all dirty, so it has to get written back. I've been exploring the idea of keeping compound pages as compound pages while they reside in PCP lists. My experimental results so far suggest that reducing the overhead of destroying and preparing compound pages does not have a substantial impact on pft and kernbench. For the record, let me share what I did and the results I obtained. # Implementation My toy implementation is here: https://gitlab.com/hyeyoo/linux/-/tree/pcp-keep-compound In this implementation, I modified free_pages_prepare() to skip destroying compound pages when trying to free to the PCP. In case the kernel failed to acquire the PCP lock, then it destroys it and frees to the buddy. Because order>0 pages in the PCP are compound pages, rmqueue() does not prepare compound pages if they are already compound pages. rmqueue_bulk() prepares compound pages when grabbing some pages from the buddy and free_pcppages_bulk() destroys compound pages before giving it back to the buddy. Non-compound multi-page allocations simply bypass the PCP. # Evaluation My experiments are done on my Ryzen 5900X (12core, 24threads) box with 128GiB of DDR4-3200MHz memory. In the summary below, '2MB-32kB-16kB' means the kernel first tries to allocate 2MB, and then falls back to 32kB/16kB. [PFT, 2MB-32kB-16kB] # timings Amean system-1 1.51 ( 0.00%) 1.50 ( 0.20%) Amean system-3 4.20 ( 0.00%) 4.19 ( 0.39%) Amean system-5 7.23 ( 0.00%) 7.20 ( 0.48%) Amean system-7 8.91 ( 0.00%) 8.97 ( -0.67%) Amean system-12 17.80 ( 0.00%) 17.75 ( 0.25%) Amean system-18 26.26 ( 0.00%) 26.14 ( 0.47%) Amean system-24 34.88 ( 0.00%) 34.97 ( -0.28%) Amean elapsed-1 1.55 ( 0.00%) 1.55 ( -0.00%) Amean elapsed-3 1.42 ( 0.00%) 1.42 ( 0.23%) Amean elapsed-5 1.48 ( 0.00%) 1.48 ( 0.11%) Amean elapsed-7 1.47 ( 0.00%) 1.47 ( -0.11%) Amean elapsed-12 1.50 ( 0.00%) 1.49 ( 0.18%) Amean elapsed-18 1.53 ( 0.00%) 1.53 ( 0.15%) Amean elapsed-24 1.56 ( 0.00%) 1.56 ( 0.04%) # faults Hmean faults/cpu-1 4268815.8976 ( 0.00%) 4270575.7783 ( 0.04%) Hmean faults/cpu-3 1547723.5433 ( 0.00%) 1551782.9058 ( 0.26%) Hmean faults/cpu-5 895050.1683 ( 0.00%) 894949.2879 ( -0.01%) Hmean faults/cpu-7 726379.5638 ( 0.00%) 719439.1475 ( -0.96%) Hmean faults/cpu-12 368548.1793 ( 0.00%) 369313.2284 ( 0.21%) Hmean faults/cpu-18 246883.6901 ( 0.00%) 248085.6269 ( 0.49%) Hmean faults/cpu-24 182289.8542 ( 0.00%) 182349.3383 ( 0.03%) Hmean faults/sec-1 4263408.8848 ( 0.00%) 4266020.8111 ( 0.06%) Hmean faults/sec-3 4631463.5864 ( 0.00%) 4642967.3237 ( 0.25%) Hmean faults/sec-5 4440309.3622 ( 0.00%) 4445610.5062 ( 0.12%) Hmean faults/sec-7 4482600.6360 ( 0.00%) 4477462.5227 ( -0.11%) Hmean faults/sec-12 4402147.6818 ( 0.00%) 4410717.5042 ( 0.19%) Hmean faults/sec-18 4301172.0933 ( 0.00%) 4308668.3804 ( 0.17%) Hmean faults/sec-24 4234656.1409 ( 0.00%) 4237997.7612 ( 0.08%) [PFT, 32kB-16kB] # timings Amean system-1 2.34 ( 0.00%) 2.38 ( -1.93%) Amean system-3 4.14 ( 0.00%) 4.15 ( -0.10%) Amean system-5 7.11 ( 0.00%) 7.10 ( 0.16%) Amean system-7 9.32 ( 0.00%) 9.28 ( 0.42%) Amean system-12 17.92 ( 0.00%) 17.96 ( -0.22%) Amean system-18 25.86 ( 0.00%) 25.62 ( 0.93%) Amean system-24 35.67 ( 0.00%) 35.66 ( 0.03%) Amean elapsed-1 2.47 ( 0.00%) 2.51 * -1.95%* Amean elapsed-3 1.42 ( 0.00%) 1.42 ( -0.07%) Amean elapsed-5 1.46 ( 0.00%) 1.45 ( 0.32%) Amean elapsed-7 1.47 ( 0.00%) 1.46 ( 0.30%) Amean elapsed-12 1.51 ( 0.00%) 1.51 ( -0.31%) Amean elapsed-18 1.53 ( 0.00%) 1.52 ( 0.22%) Amean elapsed-24 1.52 ( 0.00%) 1.52 ( 0.09%) # faults Hmean faults/cpu-1 2674791.9210 ( 0.00%) 2622289.9694 * -1.96%* Hmean faults/cpu-3 1548744.0561 ( 0.00%) 1547049.2525 ( -0.11%) Hmean faults/cpu-5 909932.0813 ( 0.00%) 911274.4409 ( 0.15%) Hmean faults/cpu-7 697042.1240 ( 0.00%) 700115.5680 ( 0.44%) Hmean faults/cpu-12 365314.7268 ( 0.00%) 364452.9596 ( -0.24%) Hmean faults/cpu-18 251711.1949 ( 0.00%) 253133.5633 ( 0.57%) Hmean faults/cpu-24 183212.0204 ( 0.00%) 183350.4502 ( 0.08%) Hmean faults/sec-1 2673435.8060 ( 0.00%) 2621259.7412 * -1.95%* Hmean faults/sec-3 4637478.9272 ( 0.00%) 4634486.4260 ( -0.06%) Hmean faults/sec-5 4531908.1729 ( 0.00%) 4541788.9414 ( 0.22%) Hmean faults/sec-7 4482779.5699 ( 0.00%) 4499528.9948 ( 0.37%) Hmean faults/sec-12 4365201.3350 ( 0.00%) 4353874.4404 ( -0.26%) Hmean faults/sec-18 4314991.0014 ( 0.00%) 4321977.0563 ( 0.16%) Hmean faults/sec-24 4344249.9657 ( 0.00%) 4345263.1455 ( 0.02%) [PFT, 16kB] # timings Amean system-1 3.13 ( 0.00%) 3.22 * -2.84%* Amean system-3 4.31 ( 0.00%) 4.38 ( -1.70%) Amean system-5 6.93 ( 0.00%) 6.95 ( -0.21%) Amean system-7 9.40 ( 0.00%) 9.47 ( -0.66%) Amean system-12 17.72 ( 0.00%) 17.75 ( -0.15%) Amean system-18 26.19 ( 0.00%) 26.12 ( 0.24%) Amean system-24 35.41 ( 0.00%) 35.31 ( 0.29%) Amean elapsed-1 3.34 ( 0.00%) 3.43 * -2.79%* Amean elapsed-3 1.50 ( 0.00%) 1.52 * -1.67%* Amean elapsed-5 1.44 ( 0.00%) 1.44 ( -0.09%) Amean elapsed-7 1.46 ( 0.00%) 1.46 ( -0.43%) Amean elapsed-12 1.50 ( 0.00%) 1.50 ( -0.04%) Amean elapsed-18 1.50 ( 0.00%) 1.49 ( 0.18%) Amean elapsed-24 1.51 ( 0.00%) 1.50 ( 0.58%) # faults Hmean faults/cpu-1 1975802.9271 ( 0.00%) 1921547.2378 * -2.75%* Hmean faults/cpu-3 1468850.1816 ( 0.00%) 1444987.8129 * -1.62%* Hmean faults/cpu-5 921763.7641 ( 0.00%) 920183.2144 ( -0.17%) Hmean faults/cpu-7 686678.6801 ( 0.00%) 681824.6507 ( -0.71%) Hmean faults/cpu-12 367972.3092 ( 0.00%) 367485.9867 ( -0.13%) Hmean faults/cpu-18 248991.2074 ( 0.00%) 249470.5429 ( 0.19%) Hmean faults/cpu-24 184426.2136 ( 0.00%) 184968.3682 ( 0.29%) Hmean faults/sec-1 1975110.1317 ( 0.00%) 1920789.4502 * -2.75%* Hmean faults/sec-3 4400629.2802 ( 0.00%) 4327578.7473 * -1.66%* Hmean faults/sec-5 4594228.5709 ( 0.00%) 4589647.3446 ( -0.10%) Hmean faults/sec-7 4524451.1451 ( 0.00%) 4502414.9629 ( -0.49%) Hmean faults/sec-12 4391814.1854 ( 0.00%) 4392709.6380 ( 0.02%) Hmean faults/sec-18 4406851.3807 ( 0.00%) 4409517.3005 ( 0.06%) Hmean faults/sec-24 4376471.9465 ( 0.00%) 4395938.4025 ( 0.44%) [kernbench, 2MB-32kB-16kB THP] Amean user-24 16236.12 ( 0.00%) 16402.41 ( -1.02%) Amean syst-24 1289.28 ( 0.00%) 1329.39 * -3.11%* Amean elsp-24 763.75 ( 0.00%) 775.03 ( -1.48%) [kernbench, 32kB-16kB THP] Amean user-24 16334.94 ( 0.00%) 16314.30 ( 0.13%) Amean syst-24 1565.50 ( 0.00%) 1609.74 * -2.83%* Amean elsp-24 779.84 ( 0.00%) 780.55 ( -0.09%) [kernbench, 16kB] Amean user-24 16205.47 ( 0.00%) 16282.73 * -0.48%* Amean syst-24 1784.03 ( 0.00%) 1837.13 * -2.98%* Amean elsp-24 787.37 ( 0.00%) 791.00 * -0.46%* I thought the fact that I made zone->lock in free_pcppages_bulk() a little bit finer-grained in the implementation had a negative impact, so I modified it to be coarse-grained and experimented again, but it was still not a clear win on kernbench and pft. So... yeah. The results for this idea are not that encouraging. Based on the results, my theories are: 1) The overhead of preparing compound pages is too small to show measurable difference (the percentage of prep_compound_page() profiles in the total profile is measured as <0.5% on pft). 2) At the time the kernel prepares compound pages, a large chunk of the cpu cache is already dirtied (e.g., when allocating 2MB anon THP, the kernel simply touches the whole 2MB of memory), thus dirtying 1.56% more of cache lines does not have a huge impact on performance. -- Cheers, Harry