On Wed, Jan 18, 2017 at 04:15:56PM +0100, Claudio Imbrenda wrote: > I'm not sure it would make sense to have this for archs that don't have > page coloring. Merging empty pages together instead of with the It's still good to be able to exercise this code on all archs (if nothing else for debugging purposes). There's nothing arch dependent in it after all. > ZERO_PAGE() would save exactly one page and it would bring no speed > advantages (or rather: not using the ZERO_PAGE() would not bring any > speed penalty). > That's why I have #ifdef'd it to have it only when page coloring is > present. Also, for what I could see, only MIPS and s390 have page > coloring; I don't like the idea of imposing any overhead to all the > other archs. With a sysctl disabled by default, the only overhead is 8 bytes and a branch? > I agree that this should be toggleable with a sysfs control, since it's > a change that can potentially negatively affect the performance in some > cases. I'm adding it in the next iteration. Yes the sysctl can be useful for archs doing page coloring too, but I would add to all. > Unless the userspace in the guests is creating lots of pages full of > zeroes :) One question comes to mind though, why is the app doing memset(0), if the app limited itself to just reading the memory it'd use page colored zero pages that wouldn't risk to become PageKsm. That is true both for bare metal and guest with KSM in host. This looks a not optimal app. > Honestly I don't think this patch will bring any benefits regarding > metadata -- one page more or less in the metadata won't change much. Our > issue is just the reading speed of the deduplicated empty pages. The metadata amount changes, for each shared zero page we'd need to allocate a rmap_item. The KSMscale introducing stable_node_chain/dup is precisely meant to deal with not creating a too large rmap_item chain. Because there can be plenty of those rmap_items, we've to create multiple zero pages and multiple stable nodes for those zero pages to limit the maximum number of rmap_items linked in a stable_node. This then limits the maximum cost of a rmap_walk on a PageKsm during page migration for compaction or swapping etc.. The KSMscale fix is needed regardless to avoid KSM to hang a very large server, because there is no guarantee the most equal page will be a zero page, it could be 0xff or anything. However if one knows there's a disproportionate amount of memory as zero (i.e. certain guest OS do that, that to me would be the main motivation for the patch), he could prefer to use your sysctl instead of creating the rmap_item metadata on the zero pages. > > memcpy for every merge-candidate page to save some metadata, it's very > > I'm confused, why memcpy? did you mean memcmp? We are not doing any > additional memops except in the case when a candidate non-empty page > happens to have the same checksum as an empty page, in which case we > have an extra memcmp compared to the normal operation. I meant memcmp sorry. So if this is further filtered by the precomputed zero page cksum, the only concern would be then how likely the cksum would provide a false positive, in which case there will be one more memcmp for every merge. > Maybe an even less intrusive change could be to check in replace_page if > is_zero_pfn(page_to_pfn(kpage)). And of course I would #ifdef that too, > to avoid the overhead for archs without page coloring. It's just one branch, the costly things are memcmp. As long as it's not memcmp I wouldn't worry about one branch in replace_page and in the caller. replace_page isn't even a fast path, the fast path is generally the code that scans the memory. The actual real merging is not as frequent. So I wouldn't use #ifdefs and I'd use the sysctl instead, potentially also enabled on all archs (or only on those with page coloring but possible to enable manually on all archs). > So if the replacement page is a ZERO_PAGE() no get_page() and no > page_add_anon_rmap() would be performed, and the set_pte_at_notify() > would have pte_mkspecial(pfn_pte(page_to_pfn(kpage))) instead of mk_pte() . That should fix your patch I think yes. Singling out zero pages was discussed before, just without KSMscale it's an incomplete fix and just a band aid. Actually even for your case it's incomplete and only covers a subset of apps, what if the app initializes all ram to 0xff instead of zero and keep reading from it? (it would make more sense to initialize the memory if it wasn't zero in fact) You'd get the same slowdown as you get now with the same zero page I think. The real fix also for this, would have to have a stable tree for each page color possible, but that is not the same as a having a stable tree for each NUMA node (which is already implemented). There are likely too many colors (even if you're not fully associative) so you'd penalize the "compression" ratio if you were to implement a generic fix that doesn't single out the zero page. I've never been particularly excited about optimizing bad apps that initialize a lot of RAM as zero when they could depend on the mmap behavior instead and get zero pages in the first place (or bad guest OS). Overall the main reason why I'm quite positive about adding this as an optimization is because after reading it (even if not complete) it seems non intrusive enough and some corner case may benefit, but if we do it, we can as well leave it available to all archs so it's easier to test and reproduce any problem too. Thanks, Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>