On 29/07/2024 04:52, Matthew Wilcox wrote: > On Fri, Jul 26, 2024 at 09:46:18PM +1200, Barry Song wrote: >> A user space interface can be implemented to select different swap-in >> order policies, similar to the mTHP allocation order policy. We need >> a distinct policy because the performance characteristics of memory >> allocation differ significantly from those of swap-in. For example, >> SSD read speeds can be much slower than memory allocation. With >> policy selection, I believe we can implement mTHP swap-in for >> non-SWAP_SYNCHRONOUS scenarios as well. However, users need to understand >> the implications of their choices. I think that it's better to start >> with at least always never. I believe that we will add auto in the >> future to tune automatically, which can be used as default finally. > > I strongly disagree. Use the same sysctl as the other anonymous memory > allocations. I vaguely recall arguing in the past that just because the user has requested 2M THP that doesn't mean its the right thing to do for performance to swap-in the whole 2M in one go. That's potentially a pretty huge latency, depending on where the backend is, and it could be a waste of IO if the application never touches most of the 2M. Although the fact that the application hinted for a 2M THP in the first place hopefully means that they are storing objects that need to be accessed at similar times. Today it will be swapped in page-by-page then eventually collapsed by khugepaged. But I think those arguments become weaker as the THP size gets smaller. 16K/64K swap-in will likely yield significant performance improvements, and I think Barry has numbers for this? So I guess we have a few options: - Just use the same sysfs interface as for anon allocation, And see if anyone reports performance regressions. Investigate one of the options below if an issue is raised. That's the simplest and cleanest approach, I think. - New sysfs interface as Barry has implemented; nobody really wants more controls if it can be helped. - Hardcode a size limit (e.g. 64K); I've tried this in a few different contexts and never got any traction. - Secret option 4: Can we allocate a full-size folio but only choose to swap-in to it bit-by-bit? You would need a way to mark which pages of the folio are valid (e.g. per-page flag) but guess that's a non-starter given the strategy to remove per-page flags? Thanks, Ryan