Chris Li <chrisl@xxxxxxxxxx> writes: > On Mon, Jun 10, 2024 at 7:38 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote: >> >> Chris Li <chrisl@xxxxxxxxxx> writes: >> >> > On Wed, Jun 5, 2024 at 7:02 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote: >> >> >> >> Chris Li <chrisl@xxxxxxxxxx> writes: >> >> >> > >> >> > In the page allocation side, we have the hugetlbfs which reserve some >> >> > memory for high order pages. >> >> > We should have similar things to allow reserve some high order swap >> >> > entries without getting polluted by low order one. >> >> >> >> TBH, I don't like the idea of high order swap entries reservation. >> > May I know more if you don't like the idea? I understand this can be >> > controversial, because previously we like to take the THP as the best >> > effort approach. If there is some reason we can't make THP, we use the >> > order 0 as fall back. >> > >> > For discussion purpose, I want break it down to smaller steps: >> > >> > First, can we agree that the following usage case is reasonable: >> > The usage case is that, as Barry has shown, zsmalloc compresses bigger >> > size than 4K and can have both better compress ratio and CPU >> > performance gain. >> > https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@xxxxxxxxx/ >> > >> > So the goal is to make THP/mTHP have some reasonable success rate >> > running in the mix size swap allocation, after either low order or >> > high order swap requests can overflow the swap file size. The allocate >> > can still recover from that, after some swap entries got free. >> > >> > Please let me know if you think the above usage case and goal are not >> > reasonable for the kernel. >> >> I think that it's reasonable to improve the success rate of high-order > > Glad to hear that. > >> swap entries allocation. I just think that it's hard to use the >> reservation based method. For example, how much should be reserved? > > Understand, it is harder to use than a fully transparent method, but > still better than no solution at all. The alternative right now is we > can't do it. > > Regarding how much we should reserve. Similarly, how much should you > choose your swap file size? If you choose N, why not N*120% or N*80%? > That did not stop us from having a swapfile, right? > >> Why system OOM when there's still swap space available? And so forth. > > Keep in mind that the reservation is an option. If you prefer the old > behavior, you don't have to use the reservation. That shouldn't be a > reason to stop others who want to use it. We don't have an alternative > solution for the long run mix size allocation yet. If there is, I like > to hear it. It's not enough to make it optional. When you run into issue, you need to debug it. And you may debug an issue on a system that is configured by someone else. >> So, I prefer the transparent methods. Just like THP vs. hugetlbfs. > > Me too. I prefer transparent over reservation if it can achieve the > same goal. Do we have a fully transparent method spec out? How to > achieve fully transparent and also avoid fragmentation caused by mix > order allocation/free? > > Keep in mind that we are still in the early stage of the mTHP swap > development, I can have the reservation patch relatively easily. If > you come up with a better transparent method patch which can achieve > the same goal later, we can use it instead. Because we are still in the early stage, I think that we should try to improve transparent solution firstly. Personally, what I don't like is that we don't work on the transparent solution because we have the reservation solution. >> >> >> that's really important for you, I think that it's better to design >> >> something like hugetlbfs vs core mm, that is, be separated from the >> >> normal swap subsystem as much as possible. >> > >> > I am giving hugetlbfs just to make the point using reservation, or >> > isolation of the resource to prevent mixing fragmentation existing in >> > core mm. >> > I am not suggesting copying the hugetlbfs implementation to the swap >> > system. Unlike hugetlbfs, the swap allocation is typically done from >> > the kernel, it is transparent from the application. I don't think >> > separate from the swap subsystem is a good way to go. >> > >> > This comes down to why you don't like the reservation. e.g. if we use >> > two swapfile, one swapfile is purely allocate for high order, would >> > that be better? >> >> Sorry, my words weren't accurate. Personally, I just think that it's >> better to make reservation related code not too intrusive. > > Yes. I will try to make it not too intrusive. > >> And, before reservation, we need to consider something else firstly. >> Whether is it generally good to swap-in with swap-out order? Should we > > When we have the reservation patch (or other means to sustain mix size > swap allocation/free), we can test it out to get more data to reason > about it. > I consider the swap in size policy an orthogonal issue. No. I don't think so. If you swap-out in higher order, but swap-in in lower order, you make the swap clusters fragmented. >> consider memory wastage too? One static policy doesn't fit all, we may >> need either a dynamic policy, or make policy configurable. >> In general, I think that we need to do this step by step. > > The core swap layer needs to be able to sustain mix size swap > allocation free in the long run. Without that the swap in size policy > is meaningless. > > Yes, that is the step by step approach. Allowing long run mix size > swap allocation as the first step. > >> >> >> > Do you see another way to protect the high order cluster polluted by >> >> >> > lower order one? >> >> >> >> >> >> If we use high-order page allocation as reference, we need something >> >> >> like compaction to guarantee high-order allocation finally. But we are >> >> >> too far from that. >> >> > >> >> > We should consider reservation for high-order swap entry allocation >> >> > similar to hugetlbfs for memory. >> >> > Swap compaction will be very complicated because it needs to scan the >> >> > PTE to migrate the swap entry. It might be easier to support folio >> >> > write out compound discontiguous swap entries. That is another way to >> >> > address the fragmentation issue. We are also too far from that as >> >> > right now. >> >> >> >> That's not easy to write out compound discontiguous swap entries too. >> >> For example, how to put folios in swap cache? >> > >> > I propose the idea in the recent LSF/MM discussion, the last few >> > slides are for the discontiguous swap and it has the discontiguous >> > entries in swap cache. >> > https://drive.google.com/file/d/10wN4WgEekaiTDiAx2AND97CYLgfDJXAD/view >> > >> > Agree it is not an easy change. The cache cache would have to change >> > the assumption all offset are contiguous. >> > For swap, we kind of have some in memory data associated with per >> > offset already, so it might provide an opportunity to combine the >> > offset related data structure for swap together. Another alternative >> > might be using xarray without the multi entry property. , just treat >> > each offset like a single entry. I haven't dug deep into this >> > direction yet. >> >> Thanks! I will study your idea. >> > > I am happy to discuss if you have any questions. > >> > We can have more discussion, maybe arrange an upstream alignment >> > meeting if there is interest. >> >> Sure. > > Ideally, if we can resolve our differences over the mail list then we > don't need to have a separate meeting :-) > -- Best Regards, Huang, Ying