Chris Li <chrisl@xxxxxxxxxx> writes: > On Mon, Jun 17, 2024 at 11:56 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote: >> >> Chris Li <chrisl@xxxxxxxxxx> writes: >> >> > That is in general true with all kernel development regardless of >> > using options or not. If there is a bug in my patch, I will need to >> > debug and fix it or the patch might be reverted. >> > >> > I don't see that as a reason to take the option path or not. The >> > option just means the user taking this option will need to understand >> > the trade off and accept the defined behavior of that option. >> >> User configuration knobs are not forbidden for Linux kernel. But we are >> more careful about them because they will introduce ABI which we need to >> maintain forever. And they are hard to be used for users. Optimizing >> automatically is generally the better solution. So, I suggest you to >> think more about the automatically solution before diving into a new >> option. > > I did, see my reply. Right now there are just no other options. > >> >> >> >> >> >> So, I prefer the transparent methods. Just like THP vs. hugetlbfs. >> >> > >> >> > Me too. I prefer transparent over reservation if it can achieve the >> >> > same goal. Do we have a fully transparent method spec out? How to >> >> > achieve fully transparent and also avoid fragmentation caused by mix >> >> > order allocation/free? >> >> > >> >> > Keep in mind that we are still in the early stage of the mTHP swap >> >> > development, I can have the reservation patch relatively easily. If >> >> > you come up with a better transparent method patch which can achieve >> >> > the same goal later, we can use it instead. >> >> >> >> Because we are still in the early stage, I think that we should try to >> >> improve transparent solution firstly. Personally, what I don't like is >> >> that we don't work on the transparent solution because we have the >> >> reservation solution. >> > >> > Do you have a road map or the design for the transparent solution you can share? >> > I am interested to know what is the short term step(e.g. a month) in >> > this transparent solution you have in mind, so we can compare the >> > different approaches. I can't reason much just by the name >> > "transparent solution" itself. Need more technical details. >> > >> > Right now we have a clear usage case we want to support, the swap >> > in/out mTHP with bigger zsmalloc buffers. We can start with the >> > limited usage case first then move to more general ones. >> >> TBH, This is what I don't like. It appears that you refuse to think >> about the transparent (or automatic) solution. > > Actually, that is not true, you make the wrong assumption about what I > have considered. I want to find out what you have in mind to compare > the near term solutions. Sorry about my wrong assumption. > In my recent LSF slide I already list 3 options to address this > fragmentation problem. > From easy to hard: > 1) Assign cluster an order on allocation and remember the cluster > order. (short term). > That is this patch series > 2) Buddy allocation on the swap entry (longer term) > 3) Folio write out compound discontinuous swap entry. (ultimate) > > I also considered 4), which I did not put into the slide, because it > is less effective than 3) > 4) migrating the swap entries, which require scan page table entry. > I briefly mentioned it during the session. Or you need something like a rmap, that isn't easy. > 3) should might qualify as your transparent solution. It is just much > harder to implement. > Even when we have 3), having some form of 1) can be beneficial as > well. (less IO count, no indirect layer of swap offset). > >> >> I haven't thought about them thoroughly, but at least we may think about >> >> - promoting low order non-full cluster when we find a free high order >> swap entries. >> >> - stealing a low order non-full cluster with low usage count for >> high-order allocation. > > Now we are talking. > These two above fall well within 2) the buddy allocators > But the buddy allocator will not be able to address all fragmentation > issues, due to the allocator not being controlled the life cycle of > the swap entry. > It will not help Barry's zsmalloc usage case much because android > likes to keep the swapfile full. I can already see that. I think that buddy-like allocator (not exactly buddy algorithm) will help fragmentation. And it will help more users because it works automatically. I don't think they are too hard to be implemented. We can try to find some simple solution firstly. So, I think that we don't need to push them to long term. At least, they can be done before introducing high-order cluster reservation ABI. Then, we can evaluate the benefit and overhead of reservation ABI. >> - freeing more swap entries when swap devices become fragmented. > > That requires a scan page table to free the swap entry, basically 4). No. You can just scan the page table of current process in do_swap_page() and try to swap-in and free more swap entries. That doesn't work well for the shared pages. However, I think that it can help quite some workloads. > It is all about investment and return. 1) is relatively easy to > implement and with good improvement and return. [snip] -- Best Regards, Huang, Ying