Re: [PATCH v6 0/6] ksm: support tracking KSM-placed zero-pages

David Hildenbrand <david@xxxxxxxxxx> · Mon, 13 Mar 2023 14:03:33 +0100

On 10.02.23 02:15, yang.yang29@xxxxxxxxxx wrote:
From: xu xin <xu.xin16@xxxxxxxxxx>

Hi,

sorry for the late follow-up. Still wrapping my head around this and 
possible alternatives. I hope we'll get some comments from others as 
well about the basic approach.

The core idea of this patch set is to enable users to perceive the number of any
pages merged by KSM, regardless of whether use_zero_page switch has been turned
on, so that users can know how much free memory increase is really due to their
madvise(MERGEABLE) actions. But the problem is, when enabling use_zero_pages,
all empty pages will be merged with kernel zero pages instead of with each
other as use_zero_pages is disabled, and then these zero-pages are no longer
monitored by KSM.

The motivations for me to do this contains three points:

1) MADV_UNMERGEABLE and other ways to trigger unsharing will *not*
    unshare the shared zeropage as placed by KSM (which is against the
    MADV_UNMERGEABLE documentation at least); see the link:
    https://lore.kernel.org/lkml/4a3daba6-18f9-d252-697c-197f65578c44@xxxxxxxxxx/

2) We cannot know how many pages are zero pages placed by KSM when
    enabling use_zero_pages, which hides the critical information about
    how much actual memory are really saved by KSM. Knowing how many
    ksm-placed zero pages are helpful for user to use the policy of madvise
    (MERGEABLE) better because they can see the actual profit brought by KSM.

3) The zero pages placed-by KSM are different from those initial empty page
    (filled with zeros) which are never touched by applications. The former
    is active-merged by KSM while the later have never consume actual memory.

I agree with all of the above, but it's still unclear to me if there is 
a real downside to a simpler approach:

(1) Tracking the shared zeropages. That would be fairly easy: whenever
    we map/unmap a shared zeropage, we simply update the counter.

(2) Unmerging all shared zeropages inside the VMAs during
    MADV_UNMERGEABLE.

(3) Documenting that MADV_UNMERGEABLE will also unmerge the shared
    zeropage when toggle xy is flipped.

It's certainly simpler and doesn't rely on the rmap item. See below.

use_zero_pages is useful, not only because of cache colouring as described
in doc, but also because use_zero_pages can accelerate merging empty pages
when there are plenty of empty pages (full of zeros) as the time of
page-by-page comparisons (unstable_tree_search_insert) is saved. So we hope to
implement the support for ksm zero page tracking without affecting the feature
of use_zero_pages.

Zero pages may be the most common merged pages in actual environment(not only VM but
also including other application like containers). Enabling use_zero_pages in the
environment with plenty of empty pages(full of zeros) will be very useful. Users and
app developer can also benefit from knowing the proportion of zero pages in all
merged pages to optimize applications.

I agree with that point, especially after I read in a paper that KSM 
applied to some applications mainly deduplicates pages filled with 0s. 
So it seems like a reasonable thing to optimize for.

With the patch series, we can both unshare zero-pages(KSM-placed) accurately
and count ksm zero pages with enabling use_zero_pages.

The problem with this approach I see is that it fundamentally relies on 
the rmap/stable-tree to detect whether a zeropage was placed or not.

I was wondering, why we even need an rmap item *at all* anymore. Why 
can't we place the shared zeropage an call it a day (remove the rmap 
item)? Once we placed a shared zeropage, the next KSM scan should better 
just ignore it, it's already deduplicated.

So if most pages we deduplicate are shared zeropages, it would be quite 
interesting to reduce the memory overhead and avoid rmap items, instead 
of building new functionality on top of it?

If we'd really want to identify whether a zeropage was deduplciated by 
KSM, we could try storing that information inside the PTE instead of 
inside the RMAP. Then, we could directly adjust the counter when zapping 
the shared zeropage or during MADV_DONTNEED/when unmerging.

Eventually, we could simply say that
* !pte_dirty(): zeropage placed during fault
* pte_dirty(): zeropage placed by KSM

Then it would also be easy to adjust counters and unmerge. We'd limit 
this handling to known-working architectures initially (spec64 still has 
the issue that pte_mkdirty() will set a pte writable ... and my patch to 
fix that was not merged yet). We'd have to double-check all 
pte_mkdirty/pte_mkclean() callsites.

--
Thanks,

David / dhildenb