Hi Andrea, On Wed, Sep 21, 2016 at 11:34 PM, Andrea Arcangeli <aarcange@xxxxxxxxxx> wrote: > Hello Gavin, > > On Wed, Sep 21, 2016 at 11:12:19PM +0800, Gavin Guo wrote: >> Recently, a similar bug can also be observed under the numad process >> with the v4.4 Ubuntu kernel or the latest upstream kernel. However, I >> think the patch should be useful to mitigate the symptom. I tried to >> search the mailing list and found the patch finally didn't be merged >> into the upstream kernel. If there are any problems which drop the >> patch? > > Zero known problems, in fact it's running in production in both RHEL7 > and RHEL6 for a while. The RHEL customers are not affected anymore for > a while now. > > It's a critical computational complexity fix, if using KSM in > enterprise production. Hugh already Acked it as well. > > It's included in -mm and Andrew submitted it once upstream, but it > bounced probably perhaps it was not the right time in the merge window > cycle. > > Or perhaps because it's complex but I wouldn't know how to simplify it > but there's no bug at all in the code. > > I would suggest Andrew to send it once again when he feels it's a good > time to do so. > >> The numad process tried to migrate a qemu process of 33GB memory. >> Finally, it stuck in the csd_lock_wait function which causes the qemu >> process hung and the virtual machine has high CPU usage and hung also. >> With KSM disabled, the symptom disappeared. > > Until it's merged upstream you can cherrypick from my aa.git tree > these three commits: > > https://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?id=9384142e4ce830898abcefc4f0479c4533fa5bbc > https://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?id=4b293be7e20c8e8731a4fdc3c3bf6047304d0cc8 > https://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?id=44c0d79c2c223c54ffe3fabc893963fc5963d611 > > They're in -mm too. > >> What happens here is that do_migrate_pages (frame #10) acquires the >> mmap_sem semaphore that everything else is waiting for (and that >> eventually produce the hang warnings), and it holds that semaphore for >> the duration of the page migration. >> >> crash> bt 2950 >> PID: 2950 TASK: ffff885f97745280 CPU: 49 COMMAND: "numad" >> [exception RIP: smp_call_function_single+219] >> RIP: ffffffff81103a0b RSP: ffff885f8fb4fb28 RFLAGS: 00000202 >> RAX: 0000000000000000 RBX: 0000000000000013 RCX: 0000000000000000 >> RDX: 0000000000000003 RSI: 0000000000000100 RDI: 0000000000000286 >> RBP: ffff885f8fb4fb70 R8: 0000000000000000 R9: 0000000000080000 >> R10: 0000000000000000 R11: ffff883faf917c88 R12: ffffffff810725f0 >> R13: 0000000000000013 R14: ffffffff810725f0 R15: ffff885f8fb4fbc8 >> CS: 0010 SS: 0018 >> #0 [ffff885f8fb4fb30] kvm_unmap_rmapp at ffffffffc01f1c3e [kvm] >> #1 [ffff885f8fb4fb78] smp_call_function_many at ffffffff81103db3 >> #2 [ffff885f8fb4fbc0] native_flush_tlb_others at ffffffff8107279d >> #3 [ffff885f8fb4fc08] flush_tlb_page at ffffffff81072a95 >> #4 [ffff885f8fb4fc30] ptep_clear_flush at ffffffff811d048e >> #5 [ffff885f8fb4fc60] try_to_unmap_one at ffffffff811cb1c7 >> #6 [ffff885f8fb4fcd0] rmap_walk_ksm at ffffffff811e6f91 >> #7 [ffff885f8fb4fd28] rmap_walk at ffffffff811cc1bf >> #8 [ffff885f8fb4fd80] try_to_unmap at ffffffff811cc46b >> #9 [ffff885f8fb4fdc8] migrate_pages at ffffffff811f26d8 >> #10 [ffff885f8fb4fe80] do_migrate_pages at ffffffff811e15f7 >> #11 [ffff885f8fb4fef8] sys_migrate_pages at ffffffff811e187d >> #12 [ffff885f8fb4ff50] entry_SYSCALL_64_fastpath at ffffffff818244f2 >> >> After some investigations, I've tried to disassemble the coredump and >> finally find the stable_node->hlist is as long as 2306920 entries. > > Yep, this is definitely getting fixed by the three commits above and > the problem is in rmap_walk_ksm like you found above. With that > applied you can't ever run into hangs anymore with KSM enabled, no > matter the workload and the amount of memory in guest and host. > > numad isn't required to reproduce it, some swapping is enough. > > It limits the de-duplication factor to 256 times, like a x256 times > compression, a x256 compression factor is clearly more than enough. So > effectively the list you found that was too long, gets hard-limited to > 256 entries with my patch applied. The limit is configurable at runtime: > > /* Maximum number of page slots sharing a stable node */ > static int ksm_max_page_sharing = 256; > > If you want to increase the limit (careful: that will increase > the rmap_walk_ksm computation time) you can echo $newsharinglimit > > /sys/kernel/mm/ksm/max_page_sharing. > > Hope this helps, > Andrea Thank you for the detail explanation. I've cherry-picked these patches and now doing the verification. I'll get back to you if there is any problem. Thanks! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>