Barry Song <21cnbao@xxxxxxxxx> writes: > On Wed, Mar 20, 2024 at 3:20 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote: >> >> Ryan Roberts <ryan.roberts@xxxxxxx> writes: >> >> > On 19/03/2024 09:20, Huang, Ying wrote: >> >> Ryan Roberts <ryan.roberts@xxxxxxx> writes: >> >> >> >>>>>> I agree phones are not the only platform. But Rome wasn't built in a >> >>>>>> day. I can only get >> >>>>>> started on a hardware which I can easily reach and have enough hardware/test >> >>>>>> resources on it. So we may take the first step which can be applied on >> >>>>>> a real product >> >>>>>> and improve its performance, and step by step, we broaden it and make it >> >>>>>> widely useful to various areas in which I can't reach :-) >> >>>>> >> >>>>> We must guarantee the normal swap path runs correctly and has no >> >>>>> performance regression when developing SWP_SYNCHRONOUS_IO optimization. >> >>>>> So we have to put some effort on the normal path test anyway. >> >>>>> >> >>>>>> so probably we can have a sysfs "enable" entry with default "n" or >> >>>>>> have a maximum >> >>>>>> swap-in order as Ryan's suggestion [1] at the beginning, >> >>>>>> >> >>>>>> " >> >>>>>> So in the common case, swap-in will pull in the same size of folio as was >> >>>>>> swapped-out. Is that definitely the right policy for all folio sizes? Certainly >> >>>>>> it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not sure >> >>>>>> it makes sense for 2M THP; As the size increases the chances of actually needing >> >>>>>> all of the folio reduces so chances are we are wasting IO. There are similar >> >>>>>> arguments for CoW, where we currently copy 1 page per fault - it probably makes >> >>>>>> sense to copy the whole folio up to a certain size. >> >>>>>> " >> >>> >> >>> I thought about this a bit more. No clear conclusions, but hoped this might help >> >>> the discussion around policy: >> >>> >> >>> The decision about the size of the THP is made at first fault, with some help >> >>> from user space and in future we might make decisions to split based on >> >>> munmap/mremap/etc hints. In an ideal world, the fact that we have had to swap >> >>> the THP out at some point in its lifetime should not impact on its size. It's >> >>> just being moved around in the system and the reason for our original decision >> >>> should still hold. >> >>> >> >>> So from that PoV, it would be good to swap-in to the same size that was >> >>> swapped-out. >> >> >> >> Sorry, I don't agree with this. It's better to swap-in and swap-out in >> >> smallest size if the page is only accessed seldom to avoid to waste >> >> memory. >> > >> > If we want to optimize only for memory consumption, I'm sure there are many >> > things we would do differently. We need to find a balance between memory and >> > performance. The benefits of folios are well documented and the kernel is >> > heading in the direction of managing memory in variable-sized blocks. So I don't >> > think it's as simple as saying we should always swap-in the smallest possible >> > amount of memory. >> >> It's conditional, that is, >> >> "if the page is only accessed seldom" >> >> Then, the page swapped-in will be swapped-out soon and adjacent pages in >> the same large folio will not be accessed during this period. >> >> So, I suggest to create an algorithm to decide swap-in order based on >> swap-readahead information automatically. It can detect the situation >> above via reduced swap readahead window size. And, if the page is >> accessed for quite long time, and the adjacent pages in the same large >> folio are accessed too, swap-readahead window will increase and large >> swap-in order will be used. > > The original size of do_anonymous_page() should be honored, considering it > embodies a decision influenced by not only sysfs settings and per-vma > HUGEPAGE hints but also architectural characteristics, for example > CONT-PTE. > > The model you're proposing may offer memory-saving benefits or reduce I/O, > but it entirely disassociates the size of the swap in from the size prior to the > swap out. Readahead isn't the only factor to determine folio order. For example, we must respect "never" policy to allocate order-0 folio always. There's no requirements to use swap-out order in swap-in too. Memory allocation has different performance character of storage reading. > Moreover, there's no guarantee that the large folio generated by > the readahead window is contiguous in the swap and can be added to the > swap cache, as we are currently dealing with folio->swap instead of > subpage->swap. Yes. We can optimize only when all conditions are satisfied. Just like other optimization. > Incidentally, do_anonymous_page() serves as the initial location for allocating > large folios. Given that memory conservation is a significant consideration in > do_swap_page(), wouldn't it be even more crucial in do_anonymous_page()? Yes. We should consider that too. IIUC, that is why mTHP support is off by default for now. After we find a way to solve the memory usage issue. We may make default "on". > A large folio, by its nature, represents a high-quality resource that has the > potential to leverage hardware characteristics for the benefit of the > entire system. But not at the cost of memory wastage. > Conversely, I don't believe that a randomly determined size dictated by the > readahead window possesses the same advantageous qualities. There's a readahead algorithm which is not pure random. > SWP_SYNCHRONOUS_IO devices are not reliant on readahead whatsoever, > their needs should also be respected. I understand that there are special requirements for SWP_SYNCHRONOUS_IO devices. I just suggest to work on general code before specific optimization. >> > You also said we should swap *out* in smallest size possible. Have I >> > misunderstood you? I thought the case for swapping-out a whole folio without >> > splitting was well established and non-controversial? >> >> That is conditional too. >> >> >> >> >>> But we only kind-of keep that information around, via the swap >> >>> entry contiguity and alignment. With that scheme it is possible that multiple >> >>> virtually adjacent but not physically contiguous folios get swapped-out to >> >>> adjacent swap slot ranges and then they would be swapped-in to a single, larger >> >>> folio. This is not ideal, and I think it would be valuable to try to maintain >> >>> the original folio size information with the swap slot. One way to do this would >> >>> be to store the original order for which the cluster was allocated in the >> >>> cluster. Then we at least know that a given swap slot is either for a folio of >> >>> that order or an order-0 folio (due to cluster exhaustion/scanning). Can we >> >>> steal a bit from swap_map to determine which case it is? Or are there better >> >>> approaches? >> >> >> >> [snip] -- Best Regards, Huang, Ying