Chris Li <chrisl@xxxxxxxxxx> writes: > On Thu, May 30, 2024 at 7:37 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote: >> >> Chris Li <chrisl@xxxxxxxxxx> writes: >> >> > On Wed, May 29, 2024 at 7:54 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote: >> > because android does not have too many cpu. We are talking about a >> > handful of clusters, which might not justify the code complexity. It >> > does not change the behavior that order 0 can pollut higher order. >> >> I have a feeling that you don't really know why swap_map[] is scanned. >> I suggest you to do more test and tracing to find out the reason. I >> suspect that there are some non-full cluster collection issues. > > Swap_map[] is scanned because of running out of non full clusters. > This can happen because Android tries to make full use of the swapfile. > However, once the swap_map[] scan happens, the non full cluster is polluted. > > I currently don't have a local reproduction of the issue Barry reported. > However here is some data point: > Two swap files, one for high order allocation only with this patch. No > fall back. > If there is a non-full cluster collection issue, we should see the > fall back in this case as well. > > BTW, same setup without this patch series it will fall back on the > high order allocation as well. > >> >> >> Another issue is nonfull_cluster[order1] cannot be used for >> >> nonfull_cluster[order2]. In definition, we should not fail order 0 >> >> allocation, we need to steal nonfull_cluster[order>0] for order 0 >> >> allocation. This can avoid to scan swap_map[] too. This may be not >> >> perfect, but it is the simplest first step implementation. You can >> >> optimize based on it further. >> > >> > Yes, that is listed as the limitation of this cluster order approach. >> > Initially we need to support one order well first. We might choose >> > what order that is, 16K or 64K folio. 4K pages are too small, 2M pages >> > are too big. The sweet spot might be some there in between. If we can >> > support one order well, we can demonstrate the value of the mTHP. We >> > can worry about other mix orders later. >> > >> > Do you have any suggestions for how to prevent the order 0 polluting >> > the higher order cluster? If we allow that to happen, then it defeats >> > the goal of being able to allocate higher order swap entries. The >> > tricky question is we don't know how much swap space we should reserve >> > for each order. We can always break higher order clusters to lower >> > order, but can't do the reserves. The current patch series lets the >> > actual usage determine the percentage of the cluster for each order. >> > However that seems not enough for the test case Barry has. When the >> > app gets OOM kill that is where a large swing of order 0 swap will >> > show up and not enough higher order usage for the brief moment. The >> > order 0 swap entry will pollute the high order cluster. We are >> > currently debating a "knob" to be able to reserve a certain % of swap >> > space for a certain order. Those reservations will be guaranteed and >> > order 0 swap entry can't pollute them even when it runs out of swap >> > space. That can make the mTHP at least usable for the Android case. >> >> IMO, the bottom line is that order-0 allocation is the first class >> citizen, we must keep it optimized. And, OOM with free swap space isn't >> acceptable. Please consider the policy we used for page allocation. > > We need to make order-0 and high order allocation both can work after > the initial pass of allocating from empty clusters. > Only order-0 allocation work is not good enough. > > In the page allocation side, we have the hugetlbfs which reserve some > memory for high order pages. > We should have similar things to allow reserve some high order swap > entries without getting polluted by low order one. TBH, I don't like the idea of high order swap entries reservation. If that's really important for you, I think that it's better to design something like hugetlbfs vs core mm, that is, be separated from the normal swap subsystem as much as possible. >> >> > Do you see another way to protect the high order cluster polluted by >> > lower order one? >> >> If we use high-order page allocation as reference, we need something >> like compaction to guarantee high-order allocation finally. But we are >> too far from that. > > We should consider reservation for high-order swap entry allocation > similar to hugetlbfs for memory. > Swap compaction will be very complicated because it needs to scan the > PTE to migrate the swap entry. It might be easier to support folio > write out compound discontiguous swap entries. That is another way to > address the fragmentation issue. We are also too far from that as > right now. That's not easy to write out compound discontiguous swap entries too. For example, how to put folios in swap cache? >> >> For specific configuration, I believe that we can get reasonable >> high-order swap entry allocation success rate for specific use cases. >> For example, if we only do limited maximum number order-0 swap entries >> allocation, can we keep high-order clusters? > > Yes we can. Having a knob to reserve some high order swap space. > Limiting order 0 is the same as having some high order swap entries > reserved. > > That is a short term solution. -- Best Regards, Huang, Ying