On Fri, Feb 2, 2024 at 8:17 PM Lance Yang <ioworker0@xxxxxxxxx> wrote: > > Hey Michal, David, Yang, > > I sincerely appreciate your time! > > I still have two questions that are perplexing me. > > First question: > Given that khugepaged doesn't treat MADV_FREE > pages as pte_none, why skip the 2M block when all > the pages within the range are old and unreferenced, > but won't skip if the partial range is MADV_FREE, > even if it's not redirtied? Why make this distinction? > Would it not be more straightforward to maintain > if either all were skipped or not? It is just some heuristic in the code and may be some arbitrary choice. It could controlled in a more fine-grained way if we really see some workloads get benefit. > > Second question: > Does copying lazyfree pages (not redirtied) to the > new huge page during khugepaged collapse > undermine the semantics of MADV_FREE? > Users mark pages as lazyfree with MADV_FREE, > expecting these pages to be eventually reclaimed. > Even without subsequent writes, these pages will > no longer be reclaimed, even if memory pressure > occurs. Yeah, it just means khugepaged wins the race against page reclaim. I'm supposed the delayed free is one of the design goals of MADV_FREE, and the risk is the pages may not be freed eventually. If you want immediate free or more deterministic behavior, you should use MADV_DONTNEED or munmap IIUC. > > BR, > Lance > > On Sat, Feb 3, 2024 at 1:42 AM Yang Shi <shy828301@xxxxxxxxx> wrote: > > > > On Fri, Feb 2, 2024 at 6:53 AM Lance Yang <ioworker0@xxxxxxxxx> wrote: > > > > > > How about blocking khugepaged from > > > collapsing lazyfree pages? This way, > > > is it not better to keep the semantics > > > of MADV_FREE? > > > > > > What do you think? > > > > First of all, khugepaged doesn't treat MADV_FREE pages as pte_none > > IIUC. The khugepaged does skip the 2M block if all the pages are old > > and unreferenced pages in the range in hpage_collapse_scan_pmd(), then > > repeat the check in collapse_huge_page() again. > > > > And MADV_FREE pages are just old and unreferenced. This is actually > > what your first test case does. The whole 2M range is MADV_FREE range, > > so they are skipped by khugepaged. > > > > But if the partial range is MADV_FREE, khugepaged won't skip them. > > This is what your second test case does. > > > > Secondly, I think it depends on the semantics of MADV_FREE, > > particularly how to treat the redirtied pages. TBH I'm always confused > > by the semantics. For example, the page contained "abcd", then it was > > MADV_FREE'ed, then it was written again with "1234" after "abcd". So > > the user should expect to see "abcd1234" or "00001234". > > > > I'm supposed it should be "abcd1234" since MADV_FREE pages are still > > valid and available, if I'm wrong please feel free to correct me. If > > so we should always copy MADV_FREE pages in khugepaged regardless of > > whether it is redirtied or not otherwise it may incur data corruption. > > If we don't copy, then the follow up redirty after collapse to the > > hugepage may return "00001234", right? > > > > The current behavior is copying the page. > > > > > > > > Thanks, > > > Lance > > > > > > On Fri, Feb 2, 2024 at 10:42 PM Michal Hocko <mhocko@xxxxxxxx> wrote: > > > > > > > > On Fri 02-02-24 21:46:45, Lance Yang wrote: > > > > > Here is a part from the man page explaining > > > > > the MADV_FREE semantics: > > > > > > > > > > The kernel can thus free thesepages, but the > > > > > freeing could be delayed until memory pressure > > > > > occurs. For each of the pages that has been > > > > > marked to be freed but has not yet been freed, > > > > > the free operation will be canceled if the caller > > > > > writes into the page. If there is no subsequent > > > > > write, the kernel can free the pages at any time. > > > > > > > > > > IIUC, if there is no subsequent write, lazyfree > > > > > pages will eventually be reclaimed. > > > > > > > > If there is no memory pressure then this might not > > > > ever happen. User cannot make any assumption about > > > > their content once madvise call has been done. The > > > > content has to be considered lost. Sure the userspace > > > > might have means to tell those pages from zero pages > > > > and recheck after the write but that is about it. > > > > > > > > > khugepaged > > > > > treats lazyfree pages the same as pte_none, > > > > > avoiding copying them to the new huge page > > > > > during collapse. It seems that lazyfree pages > > > > > are reclaimed before khugepaged collapses them. > > > > > This aligns with user expectations. > > > > > > > > > > However, IMO, if the content of MADV_FREE pages > > > > > remains valid during collapse, then khugepaged > > > > > treating lazyfree pages the same as pte_none > > > > > might not be suitable. > > > > > > > > Why? > > > > > > > > Unless I am missing something (which is possible of > > > > course) I do not really see why dropping the content > > > > of those pages and replacing them with a THP is any > > > > difference from reclaiming those pages and then faulting > > > > in a non-THP zero page. > > > > > > > > Now, if khugepaged reused the original content of MADV_FREE > > > > pages that would be a slightly different story. I can > > > > see why users would expect zero pages to back madvised > > > > area. > > > > -- > > > > Michal Hocko > > > > SUSE Labs