Hi. On Mon, May 13, 2019 at 03:37:56PM +0300, Kirill Tkhai wrote: > > Yes, I get your point. But the intention is to avoid another hacky trick > > (LD_PRELOAD), thus *something* should *preferably* be done on the > > kernel level instead. > > I don't think so. Does userspace hack introduce some overhead? It does not > look so. Why should we think about mergeable VMAs in page fault handler?! > This is the last thing we want to think in page fault handler. > > Also, there is difficult synchronization in page fault handlers, and it's > easy to make a mistake. So, there is a mistake in [3/4], and you call > ksm_enter() with mmap_sem read locked, while normal way is to call it > with write lock (see madvise_need_mmap_write()). > > So, let's don't touch this path. Small optimization for unlikely case will > introduce problems in optimization for likely case in the future. Yup, you're right, I've missed the fact that write lock is needed there. Re-vamping locking there is not my intention, so lets find another solution. > > Also, just for the sake of another piece of stats here: > > > > $ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc > > 526 > > This all requires attentive analysis. The number looks pretty big for me. > What are the pages you get merged there? This may be just zero pages, > you have identical. > > E.g., your browser want to work fast. It introduces smart schemes, > and preallocates many pages in background (mmap + write 1 byte to a page), > so in further it save some time (no page fault + alloc), when page is > really needed. But your change merges these pages and kills this > optimization. Sounds not good, does this? > > I think, we should not think we know and predict better than application > writers, what they want from kernel. Let's people decide themselves > in dependence of their workload. The only exception is some buggy > or old applications, which impossible to change, so force madvise > workaround may help. But only in case there are really such applications... > > I'd researched what pages you have duplicated in these 526 MB. Maybe > you find, no action is required or a report to userspace application > to use madvise is needed. OK, I agree, this is a good argument to move decision to userspace. > > 2) what kinds of opt-out we should maintain? Like, what if force_madvise > > is called, but the task doesn't want some VMAs to be merged? This will > > required new flag anyway, it seems. And should there be another > > write-only file to unmerge everything forcibly for specific task? > > For example, > > Merge: > #echo $task > /sys/kernel/mm/ksm/force_madvise Immediate question: what should be actually done on this? I see 2 options: 1) mark all VMAs as mergeable + set some flag for mmap() to mark all further allocations as mergeable as well; 2) just mark all the VMAs as mergeable; userspace can call this periodically to mark new VMAs. My prediction is that 2) is less destructive, and the decision is preserved predominantly to userspace, thus it would be a desired option. > Unmerge: > #echo -$task > /sys/kernel/mm/ksm/force_madvise Okay. > In case of task don't want to merge some VMA, we just should skip it at all. This way we lose some flexibility, IMO, but I get you point. Thanks. -- Best regards, Oleksandr Natalenko (post-factum) Senior Software Maintenance Engineer