Ryan Roberts <ryan.roberts@xxxxxxx> writes: > Hi All, > > This is an RFC for a small series to add support for swapping out small-sized > THP without needing to first split the large folio via __split_huge_page(). It > closely follows the approach already used by PMD-sized THP. > > "Small-sized THP" is an upcoming feature that enables performance improvements > by allocating large folios for anonymous memory, where the large folio size is > smaller than the traditional PMD-size. See [1]. > > In some circumstances I've observed a performance regression (see patch 2 for > details), and this series is an attempt to fix the regression in advance of > merging small-sized THP support. > > I've done what I thought was the smallest change possible, and as a result, this > approach is only employed when the swap is backed by a non-rotating block device > (just as PMD-sized THP is supported today). However, I have a few questions on > whether we should consider relaxing those requirements in certain circumstances: > > > 1) block-backed vs file-backed > ============================== > > The code only attempts to allocate a contiguous set of entries if swap is backed > by a block device (i.e. not file-backed). The original commit, f0eea189e8e9 > ("mm, THP, swap: don't allocate huge cluster for file backed swap device"), > stated "It's hard to write a whole transparent huge page (THP) to a file backed > swap device". But didn't state why. Does this imply there is a size limit at > which it becomes hard? And does that therefore imply that for "small enough" > sizes we should now allow use with file-back swap? > > This original commit was subsequently fixed with commit 41663430588c ("mm, THP, > swap: fix allocating cluster for swapfile by mistake"), which said the original > commit was using the wrong flag to determine if it was a block device and > therefore in some cases was actually doing large allocations for a file-backed > swap device, and this was causing file-system corruption. But that implies some > sort of correctness issue to me, rather than the performance issue I inferred > from the original commit. > > If anyone can offer an explanation, that would be helpful in determining if we > should allow some large sizes for file-backed swap. swap use 'swap extent' (swap_info_struct.swap_extent_root) to map from swap offset to storage block number. For block-backed swap, the mapping is pure linear. So, you can use arbitrary large page size. But for file-backed swap, only PAGE_SIZE alignment is guaranteed. > 2) rotating vs non-rotating > =========================== > > I notice that the clustered approach is only used for non-rotating swap. That > implies that for rotating media, we will always fail a large allocation, and > fall back to splitting THPs to single pages. Which implies that the regression > I'm fixing here may still be present on rotating media? Or perhaps rotating disk > is so slow that the cost of writing the data out dominates the cost of > splitting? > > I considered that potentially the free swap entry search algorithm that is used > in this case could be modified to look for (small) contiguous runs of entries; > Up to ~16 pages (order-4) could be done by doing 2x 64bit reads from map instead > of single byte. > > I haven't looked into this idea in detail, but wonder if anybody thinks it is > worth the effort? Or perhaps it would end up causing bad fragmentation. I doubt anybody will use rotating storage to back swap now. > Finally on testing, I've run the mm selftests and see no regressions, but I > don't think there is anything in there specifically aimed towards swap? Are > there any functional or performance tests that I should run? It would certainly > be good to confirm I haven't regressed PMD-size THP swap performance. I have used swap sub test case of vm-scalbility to test. https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/ -- Best Regards, Huang, Ying