Hello Gavin, On Wed, Sep 21, 2016 at 11:12:19PM +0800, Gavin Guo wrote: > Recently, a similar bug can also be observed under the numad process > with the v4.4 Ubuntu kernel or the latest upstream kernel. However, I > think the patch should be useful to mitigate the symptom. I tried to > search the mailing list and found the patch finally didn't be merged > into the upstream kernel. If there are any problems which drop the > patch? Zero known problems, in fact it's running in production in both RHEL7 and RHEL6 for a while. The RHEL customers are not affected anymore for a while now. It's a critical computational complexity fix, if using KSM in enterprise production. Hugh already Acked it as well. It's included in -mm and Andrew submitted it once upstream, but it bounced probably perhaps it was not the right time in the merge window cycle. Or perhaps because it's complex but I wouldn't know how to simplify it but there's no bug at all in the code. I would suggest Andrew to send it once again when he feels it's a good time to do so. > The numad process tried to migrate a qemu process of 33GB memory. > Finally, it stuck in the csd_lock_wait function which causes the qemu > process hung and the virtual machine has high CPU usage and hung also. > With KSM disabled, the symptom disappeared. Until it's merged upstream you can cherrypick from my aa.git tree these three commits: https://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?id=9384142e4ce830898abcefc4f0479c4533fa5bbc https://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?id=4b293be7e20c8e8731a4fdc3c3bf6047304d0cc8 https://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?id=44c0d79c2c223c54ffe3fabc893963fc5963d611 They're in -mm too. > What happens here is that do_migrate_pages (frame #10) acquires the > mmap_sem semaphore that everything else is waiting for (and that > eventually produce the hang warnings), and it holds that semaphore for > the duration of the page migration. > > crash> bt 2950 > PID: 2950 TASK: ffff885f97745280 CPU: 49 COMMAND: "numad" > [exception RIP: smp_call_function_single+219] > RIP: ffffffff81103a0b RSP: ffff885f8fb4fb28 RFLAGS: 00000202 > RAX: 0000000000000000 RBX: 0000000000000013 RCX: 0000000000000000 > RDX: 0000000000000003 RSI: 0000000000000100 RDI: 0000000000000286 > RBP: ffff885f8fb4fb70 R8: 0000000000000000 R9: 0000000000080000 > R10: 0000000000000000 R11: ffff883faf917c88 R12: ffffffff810725f0 > R13: 0000000000000013 R14: ffffffff810725f0 R15: ffff885f8fb4fbc8 > CS: 0010 SS: 0018 > #0 [ffff885f8fb4fb30] kvm_unmap_rmapp at ffffffffc01f1c3e [kvm] > #1 [ffff885f8fb4fb78] smp_call_function_many at ffffffff81103db3 > #2 [ffff885f8fb4fbc0] native_flush_tlb_others at ffffffff8107279d > #3 [ffff885f8fb4fc08] flush_tlb_page at ffffffff81072a95 > #4 [ffff885f8fb4fc30] ptep_clear_flush at ffffffff811d048e > #5 [ffff885f8fb4fc60] try_to_unmap_one at ffffffff811cb1c7 > #6 [ffff885f8fb4fcd0] rmap_walk_ksm at ffffffff811e6f91 > #7 [ffff885f8fb4fd28] rmap_walk at ffffffff811cc1bf > #8 [ffff885f8fb4fd80] try_to_unmap at ffffffff811cc46b > #9 [ffff885f8fb4fdc8] migrate_pages at ffffffff811f26d8 > #10 [ffff885f8fb4fe80] do_migrate_pages at ffffffff811e15f7 > #11 [ffff885f8fb4fef8] sys_migrate_pages at ffffffff811e187d > #12 [ffff885f8fb4ff50] entry_SYSCALL_64_fastpath at ffffffff818244f2 > > After some investigations, I've tried to disassemble the coredump and > finally find the stable_node->hlist is as long as 2306920 entries. Yep, this is definitely getting fixed by the three commits above and the problem is in rmap_walk_ksm like you found above. With that applied you can't ever run into hangs anymore with KSM enabled, no matter the workload and the amount of memory in guest and host. numad isn't required to reproduce it, some swapping is enough. It limits the de-duplication factor to 256 times, like a x256 times compression, a x256 compression factor is clearly more than enough. So effectively the list you found that was too long, gets hard-limited to 256 entries with my patch applied. The limit is configurable at runtime: /* Maximum number of page slots sharing a stable node */ static int ksm_max_page_sharing = 256; If you want to increase the limit (careful: that will increase the rmap_walk_ksm computation time) you can echo $newsharinglimit > /sys/kernel/mm/ksm/max_page_sharing. Hope this helps, Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>