On Tue, 10 Nov 2015, Andrea Arcangeli <aarcange@xxxxxxxxxx> wrote: > Without a max deduplication limit for each KSM page, the list of the > rmap_items associated to each stable_node can grow infinitely > large. > > During the rmap walk each entry can take up to ~10usec to process > because of IPIs for the TLB flushing (both for the primary MMU and the > secondary MMUs with the MMU notifier). With only 16GB of address space > shared in the same KSM page, that would amount to dozens of seconds of > kernel runtime. > > A ~256 max deduplication factor will reduce the latencies of the rmap > walks on KSM pages to order of a few msec. Just doing the > cond_resched() during the rmap walks is not enough, the list size must > have a limit too, otherwise the caller could get blocked in (schedule > friendly) kernel computations for seconds, unexpectedly. > > There's room for optimization to significantly reduce the IPI delivery > cost during the page_referenced(), but at least for page_migration in > the KSM case (used by hard NUMA bindings, compaction and NUMA > balancing) it may be inevitable to send lots of IPIs if each > rmap_item->mm is active on a different CPU and there are lots of > CPUs. Even if we ignore the IPI delivery cost, we've still to walk the > whole KSM rmap list, so we can't allow millions or billions (ulimited) > number of entries in the KSM stable_node rmap_item lists. > > The limit is enforced efficiently by adding a second dimension to the > stable rbtree. So there are three types of stable_nodes: the regular > ones (identical as before, living in the first flat dimension of the > stable rbtree), the "chains" and the "dups". > > Every "chain" and all "dups" linked into a "chain" enforce the > invariant that they represent the same write protected memory content, > even if each "dup" will be pointed by a different KSM page copy of > that content. This way the stable rbtree lookup computational > complexity is unaffected if compared to an unlimited > max_sharing_limit. It is still enforced that there cannot be KSM page > content duplicates in the stable rbtree itself. > > Adding the second dimension to the stable rbtree only after the > max_page_sharing limit hits, provides for a zero memory footprint > increase on 64bit archs. The memory overhead of the per-KSM page > stable_tree and per virtual mapping rmap_item is unchanged. Only after > the max_page_sharing limit hits, we need to allocate a stable_tree > "chain" and rb_replace() the "regular" stable_node with the newly > allocated stable_node "chain". After that we simply add the "regular" > stable_node to the chain as a stable_node "dup" by linking hlist_dup > in the stable_node_chain->hlist. This way the "regular" (flat) > stable_node is converted to a stable_node "dup" living in the second > dimension of the stable rbtree. > > During stable rbtree lookups the stable_node "chain" is identified as > stable_node->rmap_hlist_len == STABLE_NODE_CHAIN (aka > is_stable_node_chain()). > > When dropping stable_nodes, the stable_node "dup" is identified as > stable_node->head == STABLE_NODE_DUP_HEAD (aka is_stable_node_dup()). > > The STABLE_NODE_DUP_HEAD must be an unique valid pointer never used > elsewhere in any stable_node->head/node to avoid a clashes with the > stable_node->node.rb_parent_color pointer, and different from > &migrate_nodes. So the second field of &migrate_nodes is picked and > verified as always safe with a BUILD_BUG_ON in case the list_head > implementation changes in the future. > > The STABLE_NODE_DUP is picked as a random negative value in > stable_node->rmap_hlist_len. rmap_hlist_len cannot become negative > when it's a "regular" stable_node or a stable_node "dup". > > The stable_node_chain->nid is irrelevant. The stable_node_chain->kpfn > is aliased in a union with a time field used to rate limit the > stable_node_chain->hlist prunes. > > The garbage collection of the stable_node_chain happens lazily during > stable rbtree lookups (as for all other kind of stable_nodes), or > while disabling KSM with "echo 2 >/sys/kernel/mm/ksm/run" while > collecting the entire stable rbtree. Hi Andrea, I've been running stress tests against this patchset for a couple of hours and everything was ok. However, I've allocated ~1TB of memory and got following lockup during disabling KSM with 'echo 2 > /sys/kernel/mm/ksm/run': [13201.060601] INFO: task ksmd:351 blocked for more than 120 seconds. [13201.066812] Not tainted 4.4.0-rc4+ #5 [13201.070996] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [13201.078830] ksmd D ffff883f65eb7dc8 0 351 2 0x00000000 [13201.085903] ffff883f65eb7dc8 ffff887f66e26400 ffff883f65d5e400 ffff883f65eb8000 [13201.093343] ffffffff81a65144 ffff883f65d5e400 00000000ffffffff ffffffff81a65148 [13201.100792] ffff883f65eb7de0 ffffffff816907e5 ffffffff81a65140 ffff883f65eb7df0 [13201.108242] Call Trace: [13201.110708] [<ffffffff816907e5>] schedule+0x35/0x80 [13201.115676] [<ffffffff81690ace>] schedule_preempt_disabled+0xe/0x10 [13201.122044] [<ffffffff81692524>] __mutex_lock_slowpath+0xb4/0x130 [13201.128237] [<ffffffff816925bf>] mutex_lock+0x1f/0x2f [13201.133395] [<ffffffff811debd2>] ksm_scan_thread+0x62/0x1f0 [13201.139068] [<ffffffff810c8ac0>] ? wait_woken+0x80/0x80 [13201.144391] [<ffffffff811deb70>] ? ksm_do_scan+0x1140/0x1140 [13201.150164] [<ffffffff810a4378>] kthread+0xd8/0xf0 [13201.155056] [<ffffffff810a42a0>] ? kthread_park+0x60/0x60 [13201.160551] [<ffffffff8169460f>] ret_from_fork+0x3f/0x70 [13201.165961] [<ffffffff810a42a0>] ? kthread_park+0x60/0x60 It seems this is not connected with the new code, but it would be nice to also make unmerge_and_remove_all_rmap_items() more scheduler friendly. thanks, Petr > > While the "regular" stable_nodes and the stable_node "dups" must wait > for their underlying tree_page to be freed before they can be freed > themselves, the stable_node "chains" can be freed immediately if the > stable_node->hlist turns empty. This is because the "chains" are never > pointed by any page->mapping and they're effectively stable rbtree KSM > self contained metadata. > > Signed-off-by: Andrea Arcangeli <aarcange@xxxxxxxxxx> > --- > Documentation/vm/ksm.txt | 63 ++++ > mm/ksm.c | 731 ++++++++++++++++++++++++++++++++++++++++++----- > 2 files changed, 728 insertions(+), 66 deletions(-) > > diff --git a/Documentation/vm/ksm.txt b/Documentation/vm/ksm.txt > index f34a8ee..a7ef759 100644 > --- a/Documentation/vm/ksm.txt > +++ b/Documentation/vm/ksm.txt > @@ -80,6 +80,50 @@ run - set 0 to stop ksmd from running but keep merged pages, > Default: 0 (must be changed to 1 to activate KSM, > except if CONFIG_SYSFS is disabled) > > +max_page_sharing - Maximum sharing allowed for each KSM page. This > + enforces a deduplication limit to avoid the virtual > + memory rmap lists to grow too large. The minimum > + value is 2 as a newly created KSM page will have at > + least two sharers. The rmap walk has O(N) > + complexity where N is the number of rmap_items > + (i.e. virtual mappings) that are sharing the page, > + which is in turn capped by max_page_sharing. So > + this effectively spread the the linear O(N) > + computational complexity from rmap walk context > + over different KSM pages. The ksmd walk over the > + stable_node "chains" is also O(N), but N is the > + number of stable_node "dups", not the number of > + rmap_items, so it has not a significant impact on > + ksmd performance. In practice the best stable_node > + "dup" candidate will be kept and found at the head > + of the "dups" list. The higher this value the > + faster KSM will merge the memory (because there > + will be fewer stable_node dups queued into the > + stable_node chain->hlist to check for pruning) and > + the higher the deduplication factor will be, but > + the slowest the worst case rmap walk could be for > + any given KSM page. Slowing down the rmap_walk > + means there will be higher latency for certain > + virtual memory operations happening during > + swapping, compaction, NUMA balancing and page > + migration, in turn decreasing responsiveness for > + the caller of those virtual memory operations. The > + scheduler latency of other tasks not involved with > + the VM operations doing the rmap walk is not > + affected by this parameter as the rmap walks are > + always schedule friendly themselves. > + > +stable_node_chains_prune_millisecs - How frequently to walk the whole > + list of stable_node "dups" linked in the > + stable_node "chains" in order to prune stale > + stable_nodes. Smaller milllisecs values will free > + up the KSM metadata with lower latency, but they > + will make ksmd use more CPU during the scan. This > + only applies to the stable_node chains so it's a > + noop if not a single KSM page hit the > + max_page_sharing yet (there would be no stable_node > + chains in such case). > + > The effectiveness of KSM and MADV_MERGEABLE is shown in /sys/kernel/mm/ksm/: > > pages_shared - how many shared pages are being used > @@ -88,10 +132,29 @@ pages_unshared - how many pages unique but repeatedly checked for merging > pages_volatile - how many pages changing too fast to be placed in a tree > full_scans - how many times all mergeable areas have been scanned > > +stable_node_chains - number of stable node chains allocated, this is > + effectively the number of KSM pages that hit the > + max_page_sharing limit > +stable_node_dups - number of stable node dups queued into the > + stable_node chains > + > A high ratio of pages_sharing to pages_shared indicates good sharing, but > a high ratio of pages_unshared to pages_sharing indicates wasted effort. > pages_volatile embraces several different kinds of activity, but a high > proportion there would also indicate poor use of madvise MADV_MERGEABLE. > > +The maximum possible page_sharing/page_shared ratio is limited by the > +max_page_sharing tunable. To increase the ratio max_page_sharing must > +be increased accordingly. > + > +The stable_node_dups/stable_node_chains ratio is also affected by the > +max_page_sharing tunable, and an high ratio may indicate fragmentation > +in the stable_node dups, which could be solved by introducing > +fragmentation algorithms in ksmd which would refile rmap_items from > +one stable_node dup to another stable_node dup, in order to freeup > +stable_node "dups" with few rmap_items in them, but that may increase > +the ksmd CPU usage and possibly slowdown the readonly computations on > +the KSM pages of the applications. > + > Izik Eidus, > Hugh Dickins, 17 Nov 2009 > diff --git a/mm/ksm.c b/mm/ksm.c > index b5cd647..fd0bf51 100644 > --- a/mm/ksm.c > +++ b/mm/ksm.c > @@ -126,9 +126,12 @@ struct ksm_scan { > * struct stable_node - node of the stable rbtree > * @node: rb node of this ksm page in the stable tree > * @head: (overlaying parent) &migrate_nodes indicates temporarily on that list > + * @hlist_dup: linked into the stable_node->hlist with a stable_node chain > * @list: linked into migrate_nodes, pending placement in the proper node tree > * @hlist: hlist head of rmap_items using this ksm page > * @kpfn: page frame number of this ksm page (perhaps temporarily on wrong nid) > + * @chain_prune_time: time of the last full garbage collection > + * @rmap_hlist_len: number of rmap_item entries in hlist or STABLE_NODE_CHAIN > * @nid: NUMA node id of stable tree in which linked (may not match kpfn) > */ > struct stable_node { > @@ -136,11 +139,24 @@ struct stable_node { > struct rb_node node; /* when node of stable tree */ > struct { /* when listed for migration */ > struct list_head *head; > - struct list_head list; > + struct { > + struct hlist_node hlist_dup; > + struct list_head list; > + }; > }; > }; > struct hlist_head hlist; > - unsigned long kpfn; > + union { > + unsigned long kpfn; > + unsigned long chain_prune_time; > + }; > + /* > + * STABLE_NODE_CHAIN can be any negative number in > + * rmap_hlist_len negative range, but better not -1 to be able > + * to reliably detect underflows. > + */ > +#define STABLE_NODE_CHAIN -1024 > + int rmap_hlist_len; > #ifdef CONFIG_NUMA > int nid; > #endif > @@ -190,6 +206,7 @@ static struct rb_root *root_unstable_tree = one_unstable_tree; > > /* Recently migrated nodes of stable tree, pending proper placement */ > static LIST_HEAD(migrate_nodes); > +#define STABLE_NODE_DUP_HEAD ((struct list_head *)&migrate_nodes.prev) > > #define MM_SLOTS_HASH_BITS 10 > static DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS); > @@ -217,6 +234,18 @@ static unsigned long ksm_pages_unshared; > /* The number of rmap_items in use: to calculate pages_volatile */ > static unsigned long ksm_rmap_items; > > +/* The number of stable_node chains */ > +static unsigned long ksm_stable_node_chains; > + > +/* The number of stable_node dups linked to the stable_node chains */ > +static unsigned long ksm_stable_node_dups; > + > +/* Delay in pruning stale stable_node_dups in the stable_node_chains */ > +static int ksm_stable_node_chains_prune_millisecs = 2000; > + > +/* Maximum number of page slots sharing a stable node */ > +static int ksm_max_page_sharing = 256; > + > /* Number of pages ksmd should scan in one batch */ > static unsigned int ksm_thread_pages_to_scan = 100; > > @@ -279,6 +308,44 @@ static void __init ksm_slab_free(void) > mm_slot_cache = NULL; > } > > +static __always_inline bool is_stable_node_chain(struct stable_node *chain) > +{ > + return chain->rmap_hlist_len == STABLE_NODE_CHAIN; > +} > + > +static __always_inline bool is_stable_node_dup(struct stable_node *dup) > +{ > + return dup->head == STABLE_NODE_DUP_HEAD; > +} > + > +static inline void stable_node_chain_add_dup(struct stable_node *dup, > + struct stable_node *chain) > +{ > + VM_BUG_ON(is_stable_node_dup(dup)); > + dup->head = STABLE_NODE_DUP_HEAD; > + VM_BUG_ON(!is_stable_node_chain(chain)); > + hlist_add_head(&dup->hlist_dup, &chain->hlist); > + ksm_stable_node_dups++; > +} > + > +static inline void __stable_node_dup_del(struct stable_node *dup) > +{ > + hlist_del(&dup->hlist_dup); > + ksm_stable_node_dups--; > +} > + > +static inline void stable_node_dup_del(struct stable_node *dup) > +{ > + VM_BUG_ON(is_stable_node_chain(dup)); > + if (is_stable_node_dup(dup)) > + __stable_node_dup_del(dup); > + else > + rb_erase(&dup->node, root_stable_tree + NUMA(dup->nid)); > +#ifdef CONFIG_DEBUG_VM > + dup->head = NULL; > +#endif > +} > + > static inline struct rmap_item *alloc_rmap_item(void) > { > struct rmap_item *rmap_item; > @@ -303,6 +370,8 @@ static inline struct stable_node *alloc_stable_node(void) > > static inline void free_stable_node(struct stable_node *stable_node) > { > + VM_BUG_ON(stable_node->rmap_hlist_len && > + !is_stable_node_chain(stable_node)); > kmem_cache_free(stable_node_cache, stable_node); > } > > @@ -493,25 +562,80 @@ static inline int get_kpfn_nid(unsigned long kpfn) > return ksm_merge_across_nodes ? 0 : NUMA(pfn_to_nid(kpfn)); > } > > +static struct stable_node *alloc_stable_node_chain(struct stable_node *dup, > + struct rb_root *root) > +{ > + struct stable_node *chain = alloc_stable_node(); > + VM_BUG_ON(is_stable_node_chain(dup)); > + if (likely(chain)) { > + INIT_HLIST_HEAD(&chain->hlist); > + chain->chain_prune_time = jiffies; > + chain->rmap_hlist_len = STABLE_NODE_CHAIN; > +#ifdef CONFIG_DEBUG_VM > + chain->nid = -1; /* debug */ > +#endif > + ksm_stable_node_chains++; > + > + /* > + * Put the stable node chain in the first dimension of > + * the stable tree and at the same time remove the old > + * stable node. > + */ > + rb_replace_node(&dup->node, &chain->node, root); > + > + /* > + * Move the old stable node to the second dimension > + * queued in the hlist_dup. The invariant is that all > + * dup stable_nodes in the chain->hlist point to pages > + * that are wrprotected and have the exact same > + * content. > + */ > + stable_node_chain_add_dup(dup, chain); > + } > + return chain; > +} > + > +static inline void free_stable_node_chain(struct stable_node *chain, > + struct rb_root *root) > +{ > + rb_erase(&chain->node, root); > + free_stable_node(chain); > + ksm_stable_node_chains--; > +} > + > static void remove_node_from_stable_tree(struct stable_node *stable_node) > { > struct rmap_item *rmap_item; > > + /* check it's not STABLE_NODE_CHAIN or negative */ > + BUG_ON(stable_node->rmap_hlist_len < 0); > + > hlist_for_each_entry(rmap_item, &stable_node->hlist, hlist) { > if (rmap_item->hlist.next) > ksm_pages_sharing--; > else > ksm_pages_shared--; > + VM_BUG_ON(stable_node->rmap_hlist_len <= 0); > + stable_node->rmap_hlist_len--; > put_anon_vma(rmap_item->anon_vma); > rmap_item->address &= PAGE_MASK; > cond_resched(); > } > > + /* > + * We need the second aligned pointer of the migrate_nodes > + * list_head to stay clear from the rb_parent_color union > + * (aligned and different than any node) and also different > + * from &migrate_nodes. This will verify that future list.h changes > + * don't break STABLE_NODE_DUP_HEAD. > + */ > + BUILD_BUG_ON(STABLE_NODE_DUP_HEAD <= &migrate_nodes); > + BUILD_BUG_ON(STABLE_NODE_DUP_HEAD >= &migrate_nodes + 1); > + > if (stable_node->head == &migrate_nodes) > list_del(&stable_node->list); > else > - rb_erase(&stable_node->node, > - root_stable_tree + NUMA(stable_node->nid)); > + stable_node_dup_del(stable_node); > free_stable_node(stable_node); > } > > @@ -630,6 +754,8 @@ static void remove_rmap_item_from_tree(struct rmap_item *rmap_item) > ksm_pages_sharing--; > else > ksm_pages_shared--; > + VM_BUG_ON(stable_node->rmap_hlist_len <= 0); > + stable_node->rmap_hlist_len--; > > put_anon_vma(rmap_item->anon_vma); > rmap_item->address &= PAGE_MASK; > @@ -738,6 +864,31 @@ static int remove_stable_node(struct stable_node *stable_node) > return err; > } > > +static int remove_stable_node_chain(struct stable_node *stable_node, > + struct rb_root *root) > +{ > + struct stable_node *dup; > + struct hlist_node *hlist_safe; > + > + if (!is_stable_node_chain(stable_node)) { > + VM_BUG_ON(is_stable_node_dup(stable_node)); > + if (remove_stable_node(stable_node)) > + return true; > + else > + return false; > + } > + > + hlist_for_each_entry_safe(dup, hlist_safe, > + &stable_node->hlist, hlist_dup) { > + VM_BUG_ON(!is_stable_node_dup(dup)); > + if (remove_stable_node(dup)) > + return true; > + } > + BUG_ON(!hlist_empty(&stable_node->hlist)); > + free_stable_node_chain(stable_node, root); > + return false; > +} > + > static int remove_all_stable_nodes(void) > { > struct stable_node *stable_node; > @@ -749,7 +900,8 @@ static int remove_all_stable_nodes(void) > while (root_stable_tree[nid].rb_node) { > stable_node = rb_entry(root_stable_tree[nid].rb_node, > struct stable_node, node); > - if (remove_stable_node(stable_node)) { > + if (remove_stable_node_chain(stable_node, > + root_stable_tree + nid)) { > err = -EBUSY; > break; /* proceed to next nid */ > } > @@ -1136,6 +1288,163 @@ static struct page *try_to_merge_two_pages(struct rmap_item *rmap_item, > return err ? NULL : page; > } > > +static __always_inline > +bool __is_page_sharing_candidate(struct stable_node *stable_node, int offset) > +{ > + VM_BUG_ON(stable_node->rmap_hlist_len < 0); > + /* > + * Check that at least one mapping still exists, otherwise > + * there's no much point to merge and share with this > + * stable_node, as the underlying tree_page of the other > + * sharer is going to be freed soon. > + */ > + return stable_node->rmap_hlist_len && > + stable_node->rmap_hlist_len + offset < ksm_max_page_sharing; > +} > + > +static __always_inline > +bool is_page_sharing_candidate(struct stable_node *stable_node) > +{ > + return __is_page_sharing_candidate(stable_node, 0); > +} > + > +static struct stable_node *stable_node_dup(struct stable_node *stable_node, > + struct page **tree_page, > + struct rb_root *root, > + bool prune_stale_stable_nodes) > +{ > + struct stable_node *dup, *found = NULL; > + struct hlist_node *hlist_safe; > + struct page *_tree_page; > + int nr = 0; > + int found_rmap_hlist_len; > + > + if (!prune_stale_stable_nodes || > + time_before(jiffies, stable_node->chain_prune_time + > + msecs_to_jiffies( > + ksm_stable_node_chains_prune_millisecs))) > + prune_stale_stable_nodes = false; > + else > + stable_node->chain_prune_time = jiffies; > + > + hlist_for_each_entry_safe(dup, hlist_safe, > + &stable_node->hlist, hlist_dup) { > + cond_resched(); > + /* > + * We must walk all stable_node_dup to prune the stale > + * stable nodes during lookup. > + * > + * get_ksm_page can drop the nodes from the > + * stable_node->hlist if they point to freed pages > + * (that's why we do a _safe walk). The "dup" > + * stable_node parameter itself will be freed from > + * under us if it returns NULL. > + */ > + _tree_page = get_ksm_page(dup, false); > + if (!_tree_page) > + continue; > + nr += 1; > + if (is_page_sharing_candidate(dup)) { > + if (!found || > + dup->rmap_hlist_len > found_rmap_hlist_len) { > + if (found) > + put_page(*tree_page); > + found = dup; > + found_rmap_hlist_len = found->rmap_hlist_len; > + *tree_page = _tree_page; > + > + if (!prune_stale_stable_nodes) > + break; > + /* skip put_page */ > + continue; > + } > + } > + put_page(_tree_page); > + } > + > + /* > + * nr is relevant only if prune_stale_stable_nodes is true, > + * otherwise we may break the loop at nr == 1 even if there > + * are multiple entries. > + */ > + if (prune_stale_stable_nodes && found) { > + if (nr == 1) { > + /* > + * If there's not just one entry it would > + * corrupt memory, better BUG_ON. In KSM > + * context with no lock held it's not even > + * fatal. > + */ > + BUG_ON(stable_node->hlist.first->next); > + > + /* > + * There's just one entry and it is below the > + * deduplication limit so drop the chain. > + */ > + rb_replace_node(&stable_node->node, &found->node, > + root); > + free_stable_node(stable_node); > + ksm_stable_node_chains--; > + ksm_stable_node_dups--; > + } else if (__is_page_sharing_candidate(found, 1)) { > + /* > + * Refile our candidate at the head > + * after the prune if our candidate > + * can accept one more future sharing > + * in addition to the one underway. > + */ > + hlist_del(&found->hlist_dup); > + hlist_add_head(&found->hlist_dup, > + &stable_node->hlist); > + } > + } > + > + return found; > +} > + > +static struct stable_node *stable_node_dup_any(struct stable_node *stable_node, > + struct rb_root *root) > +{ > + if (!is_stable_node_chain(stable_node)) > + return stable_node; > + if (hlist_empty(&stable_node->hlist)) { > + free_stable_node_chain(stable_node, root); > + return NULL; > + } > + return hlist_entry(stable_node->hlist.first, > + typeof(*stable_node), hlist_dup); > +} > + > +static struct stable_node *__stable_node_chain(struct stable_node *stable_node, > + struct page **tree_page, > + struct rb_root *root, > + bool prune_stale_stable_nodes) > +{ > + if (!is_stable_node_chain(stable_node)) { > + if (is_page_sharing_candidate(stable_node)) { > + *tree_page = get_ksm_page(stable_node, false); > + return stable_node; > + } > + return NULL; > + } > + return stable_node_dup(stable_node, tree_page, root, > + prune_stale_stable_nodes); > +} > + > +static __always_inline struct stable_node *chain_prune(struct stable_node *s_n, > + struct page **t_p, > + struct rb_root *root) > +{ > + return __stable_node_chain(s_n, t_p, root, true); > +} > + > +static __always_inline struct stable_node *chain(struct stable_node *s_n, > + struct page **t_p, > + struct rb_root *root) > +{ > + return __stable_node_chain(s_n, t_p, root, false); > +} > + > /* > * stable_tree_search - search for page inside the stable tree > * > @@ -1151,7 +1460,7 @@ static struct page *stable_tree_search(struct page *page) > struct rb_root *root; > struct rb_node **new; > struct rb_node *parent; > - struct stable_node *stable_node; > + struct stable_node *stable_node, *stable_node_dup, *stable_node_any; > struct stable_node *page_node; > > page_node = page_stable_node(page); > @@ -1173,7 +1482,32 @@ again: > > cond_resched(); > stable_node = rb_entry(*new, struct stable_node, node); > - tree_page = get_ksm_page(stable_node, false); > + stable_node_any = NULL; > + stable_node_dup = chain_prune(stable_node, &tree_page, root); > + if (!stable_node_dup) { > + /* > + * Either all stable_node dups were full in > + * this stable_node chain, or this chain was > + * empty and should be rb_erased. > + */ > + stable_node_any = stable_node_dup_any(stable_node, > + root); > + if (!stable_node_any) { > + /* rb_erase just run */ > + goto again; > + } > + /* > + * Take any of the stable_node dups page of > + * this stable_node chain to let the tree walk > + * continue. All KSM pages belonging to the > + * stable_node dups in a stable_node chain > + * have the same content and they're > + * wrprotected at all times. Any will work > + * fine to continue the walk. > + */ > + tree_page = get_ksm_page(stable_node_any, false); > + } > + VM_BUG_ON(!stable_node_dup ^ !!stable_node_any); > if (!tree_page) { > /* > * If we walked over a stale stable_node, > @@ -1196,6 +1530,34 @@ again: > else if (ret > 0) > new = &parent->rb_right; > else { > + if (page_node) { > + VM_BUG_ON(page_node->head != &migrate_nodes); > + /* > + * Test if the migrated page should be merged > + * into a stable node dup. If the mapcount is > + * 1 we can migrate it with another KSM page > + * without adding it to the chain. > + */ > + if (page_mapcount(page) > 1) > + goto chain_append; > + } > + > + if (!stable_node_dup) { > + /* > + * If the stable_node is a chain and > + * we got a payload match in memcmp > + * but we cannot merge the scanned > + * page in any of the existing > + * stable_node dups because they're > + * all full, we need to wait the > + * scanned page to find itself a match > + * in the unstable tree to create a > + * brand new KSM page to add later to > + * the dups of this stable_node. > + */ > + return NULL; > + } > + > /* > * Lock and unlock the stable_node's page (which > * might already have been migrated) so that page > @@ -1203,23 +1565,21 @@ again: > * It would be more elegant to return stable_node > * than kpage, but that involves more changes. > */ > - tree_page = get_ksm_page(stable_node, true); > - if (tree_page) { > - unlock_page(tree_page); > - if (get_kpfn_nid(stable_node->kpfn) != > - NUMA(stable_node->nid)) { > - put_page(tree_page); > - goto replace; > - } > - return tree_page; > - } > - /* > - * There is now a place for page_node, but the tree may > - * have been rebalanced, so re-evaluate parent and new. > - */ > - if (page_node) > + tree_page = get_ksm_page(stable_node_dup, true); > + if (unlikely(!tree_page)) > + /* > + * The tree may have been rebalanced, > + * so re-evaluate parent and new. > + */ > goto again; > - return NULL; > + unlock_page(tree_page); > + > + if (get_kpfn_nid(stable_node_dup->kpfn) != > + NUMA(stable_node_dup->nid)) { > + put_page(tree_page); > + goto replace; > + } > + return tree_page; > } > } > > @@ -1230,22 +1590,72 @@ again: > DO_NUMA(page_node->nid = nid); > rb_link_node(&page_node->node, parent, new); > rb_insert_color(&page_node->node, root); > - get_page(page); > - return page; > +out: > + if (is_page_sharing_candidate(page_node)) { > + get_page(page); > + return page; > + } else > + return NULL; > > replace: > - if (page_node) { > - list_del(&page_node->list); > - DO_NUMA(page_node->nid = nid); > - rb_replace_node(&stable_node->node, &page_node->node, root); > - get_page(page); > + if (stable_node_dup == stable_node) { > + /* there is no chain */ > + if (page_node) { > + VM_BUG_ON(page_node->head != &migrate_nodes); > + list_del(&page_node->list); > + DO_NUMA(page_node->nid = nid); > + rb_replace_node(&stable_node->node, &page_node->node, > + root); > + if (is_page_sharing_candidate(page_node)) > + get_page(page); > + else > + page = NULL; > + } else { > + rb_erase(&stable_node->node, root); > + page = NULL; > + } > } else { > - rb_erase(&stable_node->node, root); > - page = NULL; > + VM_BUG_ON(!is_stable_node_chain(stable_node)); > + __stable_node_dup_del(stable_node_dup); > + if (page_node) { > + VM_BUG_ON(page_node->head != &migrate_nodes); > + list_del(&page_node->list); > + DO_NUMA(page_node->nid = nid); > + stable_node_chain_add_dup(page_node, stable_node); > + if (is_page_sharing_candidate(page_node)) > + get_page(page); > + else > + page = NULL; > + } else { > + page = NULL; > + } > } > - stable_node->head = &migrate_nodes; > - list_add(&stable_node->list, stable_node->head); > + stable_node_dup->head = &migrate_nodes; > + list_add(&stable_node_dup->list, stable_node_dup->head); > return page; > + > +chain_append: > + /* stable_node_dup could be null if it reached the limit */ > + if (!stable_node_dup) > + stable_node_dup = stable_node_any; > + if (stable_node_dup == stable_node) { > + /* chain is missing so create it */ > + stable_node = alloc_stable_node_chain(stable_node_dup, > + root); > + if (!stable_node) > + return NULL; > + } > + /* > + * Add this stable_node dup that was > + * migrated to the stable_node chain > + * of the current nid for this page > + * content. > + */ > + VM_BUG_ON(page_node->head != &migrate_nodes); > + list_del(&page_node->list); > + DO_NUMA(page_node->nid = nid); > + stable_node_chain_add_dup(page_node, stable_node); > + goto out; > } > > /* > @@ -1262,7 +1672,8 @@ static struct stable_node *stable_tree_insert(struct page *kpage) > struct rb_root *root; > struct rb_node **new; > struct rb_node *parent; > - struct stable_node *stable_node; > + struct stable_node *stable_node, *stable_node_dup, *stable_node_any; > + bool need_chain = false; > > kpfn = page_to_pfn(kpage); > nid = get_kpfn_nid(kpfn); > @@ -1277,7 +1688,32 @@ again: > > cond_resched(); > stable_node = rb_entry(*new, struct stable_node, node); > - tree_page = get_ksm_page(stable_node, false); > + stable_node_any = NULL; > + stable_node_dup = chain(stable_node, &tree_page, root); > + if (!stable_node_dup) { > + /* > + * Either all stable_node dups were full in > + * this stable_node chain, or this chain was > + * empty and should be rb_erased. > + */ > + stable_node_any = stable_node_dup_any(stable_node, > + root); > + if (!stable_node_any) { > + /* rb_erase just run */ > + goto again; > + } > + /* > + * Take any of the stable_node dups page of > + * this stable_node chain to let the tree walk > + * continue. All KSM pages belonging to the > + * stable_node dups in a stable_node chain > + * have the same content and they're > + * wrprotected at all times. Any will work > + * fine to continue the walk. > + */ > + tree_page = get_ksm_page(stable_node_any, false); > + } > + VM_BUG_ON(!stable_node_dup ^ !!stable_node_any); > if (!tree_page) { > /* > * If we walked over a stale stable_node, > @@ -1300,27 +1736,37 @@ again: > else if (ret > 0) > new = &parent->rb_right; > else { > - /* > - * It is not a bug that stable_tree_search() didn't > - * find this node: because at that time our page was > - * not yet write-protected, so may have changed since. > - */ > - return NULL; > + need_chain = true; > + break; > } > } > > - stable_node = alloc_stable_node(); > - if (!stable_node) > + stable_node_dup = alloc_stable_node(); > + if (!stable_node_dup) > return NULL; > > - INIT_HLIST_HEAD(&stable_node->hlist); > - stable_node->kpfn = kpfn; > - set_page_stable_node(kpage, stable_node); > - DO_NUMA(stable_node->nid = nid); > - rb_link_node(&stable_node->node, parent, new); > - rb_insert_color(&stable_node->node, root); > + INIT_HLIST_HEAD(&stable_node_dup->hlist); > + stable_node_dup->kpfn = kpfn; > + set_page_stable_node(kpage, stable_node_dup); > + stable_node_dup->rmap_hlist_len = 0; > + DO_NUMA(stable_node_dup->nid = nid); > + if (!need_chain) { > + rb_link_node(&stable_node_dup->node, parent, new); > + rb_insert_color(&stable_node_dup->node, root); > + } else { > + if (!is_stable_node_chain(stable_node)) { > + struct stable_node *orig = stable_node; > + /* chain is missing so create it */ > + stable_node = alloc_stable_node_chain(orig, root); > + if (!stable_node) { > + free_stable_node(stable_node_dup); > + return NULL; > + } > + } > + stable_node_chain_add_dup(stable_node_dup, stable_node); > + } > > - return stable_node; > + return stable_node_dup; > } > > /* > @@ -1410,8 +1856,27 @@ struct rmap_item *unstable_tree_search_insert(struct rmap_item *rmap_item, > * the same ksm page. > */ > static void stable_tree_append(struct rmap_item *rmap_item, > - struct stable_node *stable_node) > + struct stable_node *stable_node, > + bool max_page_sharing_bypass) > { > + /* > + * rmap won't find this mapping if we don't insert the > + * rmap_item in the right stable_node > + * duplicate. page_migration could break later if rmap breaks, > + * so we can as well crash here. We really need to check for > + * rmap_hlist_len == STABLE_NODE_CHAIN, but we can as well check > + * for other negative values as an undeflow if detected here > + * for the first time (and not when decreasing rmap_hlist_len) > + * would be sign of memory corruption in the stable_node. > + */ > + BUG_ON(stable_node->rmap_hlist_len < 0); > + > + stable_node->rmap_hlist_len++; > + if (!max_page_sharing_bypass) > + /* possibly non fatal but unexpected overflow, only warn */ > + WARN_ON_ONCE(stable_node->rmap_hlist_len > > + ksm_max_page_sharing); > + > rmap_item->head = stable_node; > rmap_item->address |= STABLE_FLAG; > hlist_add_head(&rmap_item->hlist, &stable_node->hlist); > @@ -1439,19 +1904,26 @@ static void cmp_and_merge_page(struct page *page, struct rmap_item *rmap_item) > struct page *kpage; > unsigned int checksum; > int err; > + bool max_page_sharing_bypass = false; > > stable_node = page_stable_node(page); > if (stable_node) { > if (stable_node->head != &migrate_nodes && > - get_kpfn_nid(stable_node->kpfn) != NUMA(stable_node->nid)) { > - rb_erase(&stable_node->node, > - root_stable_tree + NUMA(stable_node->nid)); > + get_kpfn_nid(READ_ONCE(stable_node->kpfn)) != > + NUMA(stable_node->nid)) { > + stable_node_dup_del(stable_node); > stable_node->head = &migrate_nodes; > list_add(&stable_node->list, stable_node->head); > } > if (stable_node->head != &migrate_nodes && > rmap_item->head == stable_node) > return; > + /* > + * If it's a KSM fork, allow it to go over the sharing limit > + * without warnings. > + */ > + if (!is_page_sharing_candidate(stable_node)) > + max_page_sharing_bypass = true; > } > > /* We first start with searching the page inside the stable tree */ > @@ -1471,7 +1943,8 @@ static void cmp_and_merge_page(struct page *page, struct rmap_item *rmap_item) > * add its rmap_item to the stable tree. > */ > lock_page(kpage); > - stable_tree_append(rmap_item, page_stable_node(kpage)); > + stable_tree_append(rmap_item, page_stable_node(kpage), > + max_page_sharing_bypass); > unlock_page(kpage); > } > put_page(kpage); > @@ -1504,8 +1977,10 @@ static void cmp_and_merge_page(struct page *page, struct rmap_item *rmap_item) > lock_page(kpage); > stable_node = stable_tree_insert(kpage); > if (stable_node) { > - stable_tree_append(tree_rmap_item, stable_node); > - stable_tree_append(rmap_item, stable_node); > + stable_tree_append(tree_rmap_item, stable_node, > + false); > + stable_tree_append(rmap_item, stable_node, > + false); > } > unlock_page(kpage); > > @@ -2009,6 +2484,48 @@ static void wait_while_offlining(void) > } > } > > +static bool stable_node_dup_remove_range(struct stable_node *stable_node, > + unsigned long start_pfn, > + unsigned long end_pfn) > +{ > + if (stable_node->kpfn >= start_pfn && > + stable_node->kpfn < end_pfn) { > + /* > + * Don't get_ksm_page, page has already gone: > + * which is why we keep kpfn instead of page* > + */ > + remove_node_from_stable_tree(stable_node); > + return true; > + } > + return false; > +} > + > +static bool stable_node_chain_remove_range(struct stable_node *stable_node, > + unsigned long start_pfn, > + unsigned long end_pfn, > + struct rb_root *root) > +{ > + struct stable_node *dup; > + struct hlist_node *hlist_safe; > + > + if (!is_stable_node_chain(stable_node)) { > + VM_BUG_ON(is_stable_node_dup(stable_node)); > + return stable_node_dup_remove_range(stable_node, start_pfn, > + end_pfn); > + } > + > + hlist_for_each_entry_safe(dup, hlist_safe, > + &stable_node->hlist, hlist_dup) { > + VM_BUG_ON(!is_stable_node_dup(dup)); > + stable_node_dup_remove_range(dup, start_pfn, end_pfn); > + } > + if (hlist_empty(&stable_node->hlist)) { > + free_stable_node_chain(stable_node, root); > + return true; /* notify caller that tree was rebalanced */ > + } else > + return false; > +} > + > static void ksm_check_stable_tree(unsigned long start_pfn, > unsigned long end_pfn) > { > @@ -2021,15 +2538,12 @@ static void ksm_check_stable_tree(unsigned long start_pfn, > node = rb_first(root_stable_tree + nid); > while (node) { > stable_node = rb_entry(node, struct stable_node, node); > - if (stable_node->kpfn >= start_pfn && > - stable_node->kpfn < end_pfn) { > - /* > - * Don't get_ksm_page, page has already gone: > - * which is why we keep kpfn instead of page* > - */ > - remove_node_from_stable_tree(stable_node); > + if (stable_node_chain_remove_range(stable_node, > + start_pfn, end_pfn, > + root_stable_tree + > + nid)) > node = rb_first(root_stable_tree + nid); > - } else > + else > node = rb_next(node); > cond_resched(); > } > @@ -2254,6 +2768,47 @@ static ssize_t merge_across_nodes_store(struct kobject *kobj, > KSM_ATTR(merge_across_nodes); > #endif > > +static ssize_t max_page_sharing_show(struct kobject *kobj, > + struct kobj_attribute *attr, char *buf) > +{ > + return sprintf(buf, "%u\n", ksm_max_page_sharing); > +} > + > +static ssize_t max_page_sharing_store(struct kobject *kobj, > + struct kobj_attribute *attr, > + const char *buf, size_t count) > +{ > + int err; > + int knob; > + > + err = kstrtoint(buf, 10, &knob); > + if (err) > + return err; > + /* > + * When a KSM page is created it is shared by 2 mappings. This > + * being a signed comparison, it implicitly verifies it's not > + * negative. > + */ > + if (knob < 2) > + return -EINVAL; > + > + if (READ_ONCE(ksm_max_page_sharing) == knob) > + return count; > + > + mutex_lock(&ksm_thread_mutex); > + wait_while_offlining(); > + if (ksm_max_page_sharing != knob) { > + if (ksm_pages_shared || remove_all_stable_nodes()) > + err = -EBUSY; > + else > + ksm_max_page_sharing = knob; > + } > + mutex_unlock(&ksm_thread_mutex); > + > + return err ? err : count; > +} > +KSM_ATTR(max_page_sharing); > + > static ssize_t pages_shared_show(struct kobject *kobj, > struct kobj_attribute *attr, char *buf) > { > @@ -2292,6 +2847,46 @@ static ssize_t pages_volatile_show(struct kobject *kobj, > } > KSM_ATTR_RO(pages_volatile); > > +static ssize_t stable_node_dups_show(struct kobject *kobj, > + struct kobj_attribute *attr, char *buf) > +{ > + return sprintf(buf, "%lu\n", ksm_stable_node_dups); > +} > +KSM_ATTR_RO(stable_node_dups); > + > +static ssize_t stable_node_chains_show(struct kobject *kobj, > + struct kobj_attribute *attr, char *buf) > +{ > + return sprintf(buf, "%lu\n", ksm_stable_node_chains); > +} > +KSM_ATTR_RO(stable_node_chains); > + > +static ssize_t > +stable_node_chains_prune_millisecs_show(struct kobject *kobj, > + struct kobj_attribute *attr, > + char *buf) > +{ > + return sprintf(buf, "%u\n", ksm_stable_node_chains_prune_millisecs); > +} > + > +static ssize_t > +stable_node_chains_prune_millisecs_store(struct kobject *kobj, > + struct kobj_attribute *attr, > + const char *buf, size_t count) > +{ > + unsigned long msecs; > + int err; > + > + err = kstrtoul(buf, 10, &msecs); > + if (err || msecs > UINT_MAX) > + return -EINVAL; > + > + ksm_stable_node_chains_prune_millisecs = msecs; > + > + return count; > +} > +KSM_ATTR(stable_node_chains_prune_millisecs); > + > static ssize_t full_scans_show(struct kobject *kobj, > struct kobj_attribute *attr, char *buf) > { > @@ -2311,6 +2906,10 @@ static struct attribute *ksm_attrs[] = { > #ifdef CONFIG_NUMA > &merge_across_nodes_attr.attr, > #endif > + &max_page_sharing_attr.attr, > + &stable_node_chains_attr.attr, > + &stable_node_dups_attr.attr, > + &stable_node_chains_prune_millisecs_attr.attr, > NULL, > }; > -- Petr Holasek pholasek@xxxxxxxxxx -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>