Hi. On Mon, May 13, 2019 at 01:38:43PM +0300, Kirill Tkhai wrote: > On 10.05.2019 10:21, Oleksandr Natalenko wrote: > > By default, KSM works only on memory that is marked by madvise(). And the > > only way to get around that is to either: > > > > * use LD_PRELOAD; or > > * patch the kernel with something like UKSM or PKSM. > > > > Instead, lets implement a so-called "always" mode, which allows marking > > VMAs as mergeable on do_anonymous_page() call automatically. > > > > The submission introduces a new sysctl knob as well as kernel cmdline option > > to control which mode to use. The default mode is to maintain old > > (madvise-based) behaviour. > > > > Due to security concerns, this submission also introduces VM_UNMERGEABLE > > vmaflag for apps to explicitly opt out of automerging. Because of adding > > a new vmaflag, the whole work is available for 64-bit architectures only. > >> This patchset is based on earlier Timofey's submission [1], but it doesn't > > use dedicated kthread to walk through the list of tasks/VMAs. > > > > For my laptop it saves up to 300 MiB of RAM for usual workflow (browser, > > terminal, player, chats etc). Timofey's submission also mentions > > containerised workload that benefits from automerging too. > > This all approach looks complicated for me, and I'm not sure the shown profit > for desktop is big enough to introduce contradictory vma flags, boot option > and advance page fault handler. Also, 32/64bit defines do not look good for > me. I had tried something like this on my laptop some time ago, and > the result was bad even in absolute (not in memory percentage) meaning. > Isn't LD_PRELOAD trick enough to desktop? Your workload is same all the time, > so you may statically insert correct preload to /etc/profile and replace > your mmap forever. > > Speaking about containers, something like this may have a sense, I think. > The probability of that several containers have the same pages are higher, > than that desktop applications have the same pages; also LD_PRELOAD for > containers is not applicable. Yes, I get your point. But the intention is to avoid another hacky trick (LD_PRELOAD), thus *something* should *preferably* be done on the kernel level instead. > But 1)this could be made for trusted containers only (are there similar > issues with KSM like with hardware side-channel attacks?!); Regarding side-channel attacks, yes, I think so. Were those openssl guys who complained about it?.. > 2) the most > shared data for containers in my experience is file cache, which is not > supported by KSM. > > There are good results by the link [1], but it's difficult to analyze > them without knowledge about what happens inside them there. > > Some of tests have "VM" prefix. What the reason the hypervisor don't mark > their VMAs as mergeable? Can't this be fixed in hypervisor? What is the > generic reason that VMAs are not marked in all the tests? Timofey, could you please address this? Also, just for the sake of another piece of stats here: $ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc 526 > In case of there is a fundamental problem of calling madvise, can't we > just implement an easier workaround like a new write-only file: > > #echo $task > /sys/kernel/mm/ksm/force_madvise > > which will mark all anon VMAs as mergeable for a passed task's mm? > > A small userspace daemon may write mergeable tasks there from time to time. > > Then we won't need to introduce additional vm flags and to change > anon pagefault handler, and the changes will be small and only > related to mm/ksm.c, and good enough for both 32 and 64 bit machines. Yup, looks appealing. Two concerns, though: 1) we are falling back to scanning through the list of tasks (I guess this is what we wanted to avoid, although this time it happens in the userspace); 2) what kinds of opt-out we should maintain? Like, what if force_madvise is called, but the task doesn't want some VMAs to be merged? This will required new flag anyway, it seems. And should there be another write-only file to unmerge everything forcibly for specific task? Thanks. P.S. Cc'ing Pavel properly this time. -- Best regards, Oleksandr Natalenko (post-factum) Senior Software Maintenance Engineer