Re: [PATCH 1/1] ksm: introduce ksm_max_page_sharing per page deduplication limit

Andrea Arcangeli <aarcange@xxxxxxxxxx> · Fri, 28 Oct 2016 20:31:13 +0200

On Fri, Oct 28, 2016 at 02:26:03PM +0800, Gavin Guo wrote:
> I have tried verifying these patches. However, the default 256
> bytes max_page_sharing still suffers the hung task issue. Then, the
> following sequence has been tried to mitigate the symptom. When the
> value is decreased, it took more time to reproduce the symptom.
> Finally, the value 8 has been tried and I didn't continue with lower
> value.
> 
> 128 -> 64 -> 32 -> 16 -> 8
> 
> The crashdump has also been investigated.

You should try to get multiple sysrq+l too during the hang.

> stable_node: 0xffff880d36413040 stable_node->hlist->first = 0xffff880e4c9f4cf0
> crash> list hlist_node.next 0xffff880e4c9f4cf0  > rmap_item.lst
> 
> $ wc -l rmap_item.lst
> $ 8 rmap_item.lst
> 
> This shows that the list is actually reduced to 8 items. I wondered if the
> loop is still consuming a lot of time and hold the mmap_sem too long.

Even the default 256 would be enough (certainly with KVM that doesn't
have a deep anon_vma interval tree).

Perhaps this is an app with a massively large anon_vma interval tree
and uses MADV_MERGEABLE and not qemu/kvm? However then you'd run in
similar issues with anon pages rmap walks so KSM wouldn't be to
blame. The depth of the rmap_items multiplies the cost of the rbtree
walk 512 times but still it shouldn't freeze for seconds.

The important thing here is that the app is in control of the max
depth of the anon_vma interval tree while it's not in control of the
max depth of the rmap_item list, this is why it's fundamental that the
KSM rmap_item list is bounded to a max value, while the depth of the
interval tree is secondary issue because userland has a chance to
optimize for it. If the app deep forks and uses MADV_MERGEABLE that is
possible to optimize in userland. But I guess the app that is using
MADV_MERGEABLE is qemu/kvm for you too so it can't be a too long
interval tree. Furthermore if when the symptom triggers you still get
a long hang even with rmap_item depth of 8 and it just takes longer
time to reach the hanging point, it may be something else.

I assume this is not an upstream kernel, can you reproduce on the
upstream kernel? Sorry but I can't help you any further, if this isn't
first verified on the upstream kernel.

Also if you test on the upstream kernel you can leave the default
value of 256 and then use sysrq+l to get multiple dumps of what's
running in the CPUs. The crash dump is useful as well but it's also
interesting to see what's running most frequently during the hang
(which isn't guaranteed to be shown by the exact point in time the
crash dump is being taken). perf top -g may also help if this is a
computational complexity issue inside the kernel to see where most CPU
is being burnt.

Note the problem was reproduced and verified as fixed. It's quite easy
to reproduce, I used migrate_pages syscall to do that, and after the
deep KSM merging that takes several seconds in strace -tt, while with
the fix it stays in the order of milliseconds. The point is that with
deeper merging the migrate_pages could take minutes in unkillable R
state (or during swapping), while with the KSMscale fix it gets capped
to milliseconds no matter what.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>