Hey Zach, Yang, Michal, and David, Please accept my sincerest apologies for the delayed response. Thanks for the replies; it‘s been very helpful to me! I also appreciate the valuable information you’ve shared! I agree that it’s not a good idea to let khugepaged avoid any pages marked with MADV_FREE. Thanks again for your time! Best, Lance On Tue, Feb 6, 2024 at 4:27 AM Zach O'Keefe <zokeefe@xxxxxxxxxx> wrote: > > On Mon, Feb 5, 2024 at 11:43 AM Yang Shi <shy828301@xxxxxxxxx> wrote: > > > > On Mon, Feb 5, 2024 at 1:45 AM Michal Hocko <mhocko@xxxxxxxx> wrote: > > > > > > On Fri 02-02-24 09:42:27, Yang Shi wrote: > > > > But if the partial range is MADV_FREE, khugepaged won't skip them. > > > > This is what your second test case does. > > > > > > > > Secondly, I think it depends on the semantics of MADV_FREE, > > > > particularly how to treat the redirtied pages. TBH I'm always confused > > > > by the semantics. For example, the page contained "abcd", then it was > > > > MADV_FREE'ed, then it was written again with "1234" after "abcd". So > > > > the user should expect to see "abcd1234" or "00001234". > > > > > > Correct. You cannot assume the content of the first page as it could > > > have been reclaimed at any time. > > > > > > > I'm supposed it should be "abcd1234" since MADV_FREE pages are still > > > > valid and available, if I'm wrong please feel free to correct me. If > > > > so we should always copy MADV_FREE pages in khugepaged regardless of > > > > whether it is redirtied or not otherwise it may incur data corruption. > > > > If we don't copy, then the follow up redirty after collapse to the > > > > hugepage may return "00001234", right? > > > > > > Right. As pointed above this is a valid outcome if the page has been > > > dropped. User has means to tell that from /proc/vmstat though. Not in a > > > great precision but I think it would be really surprising to not see any > > > pglazyfreed yet the content is gone. I think it would be legit to call > > > it a bug. One could argue the bug would be in the accounting rather than > > > the khugepaged implementation because madvised pages could be dropped at > > > any time. But I think it makes more sense to copy the existing content. > > +1. I agree that the content should be dropped iff pglazyfreed is > incremented. Of course, we could do that here, but I don't think there > is a good reason to, and we should just copy the contents. > > > Yeah, as long as khugepaged sees the MADV_FREE pages, it means they > > have "NOT" been dropped yet. It may be dropped later if memory > > pressure occurs, but anyway khugepaged wins the race and khugepaged > > can't assume the pages will be dropped before they get redirtied. So > > copying the content does make sense. > > Per Lance, I kinda get that this "undermines" MADV_FREE, insofar that, > from the user's perspective, that memory which was intended as a > buffer against OOM kill scenarios, is no longer there to reclaim trivially. I > don't have a real world example where this is an issue, however. Also, > not copying the contents doesn't change that fact. > > The proper alternative, if you want to make the "undermining" > argument, is for khugepaged to stay away from hugepage regions with > any MADV_FREE pages. I think it's fair to assume MADV_FREE'd memory is > likely cold memory, and therefore not a good hugepage target anyways. > However, it'd be unfortunate if there were a couple MADV_FREE pages in > the middle of an otherwise hot / highly-utilized hugepage region that > would prevent it from being pmd-mapped via khugepaged. But.. this is > exactly-ish what you get when hugepage-ware system/runtime allocators > split THPs to free up internal caches. > > Best, > Zach > > > > > -- > > > Michal Hocko > > > SUSE Labs