Adding linux-mm too, On Thu, Apr 14, 2016 at 01:34:41PM +0100, Dr. David Alan Gilbert wrote: > * Andrea Arcangeli (aarcange@xxxxxxxxxx) wrote: > > > The next suspect is the massive THP refcounting change that went > > upstream recently: > > > As further debug hint, can you try to disable THP and see if that > > makes the problem go away? > > Yep, this seems to be the problem (cc'ing in Kirill). > > 122afea9626ab3f717b250a8dd3d5ebf57cdb56c - works (just before Kirill disables THP) > 61f5d698cc97600e813ca5cf8e449b1ea1c11492 - breaks (when THP is reenabled) > > It's pretty reliable; as you say disabling THP makes it work again > and putting it back to THP/madvise mode makes it break. And you need > to test on a machine with some free ram to make sure THP has a chance > to have happened. > > I'm not sure of all of the rework that happened in that series, > but my reading of it is that splitting of THP pages gets deferred; > so I wonder if when I do the madvise to turn THP off, if it's actually > still got THP pages and thus we end up with a whole THP mapped > when I'm expecting to be userfaulting those pages. Good thing at least I didn't make UFFDIO_COPY THP aware yet so there's less variables (as no user was interested to handle userfaults at THP granularity yet, and from userland such an improvement would be completely invisible in terms of API, so if an user starts doing that we can just optimize the kernel for it, criu restore could do that as the faults will come from disk-I/O, when network is involved THP userfaults wouldn't have a great tradeoff with regard to the increased fault latency). I suspect there is an handle_userfault missing somewhere in connection with trans_huge_pmd splits (not anymore THP splits) that you're doing with MADV_DONTNEED to zap those pages in the destination that got redirtied in source during the last precopy stage. Or more simply MADV_DONTNEED isn't zapping all the right ptes after the trans huge pmd got splitted. The fact the page isn't splitted shouldn't matter too much, all we care about is the pte triggers handle_userfault after MADV_DONTNEED. The userfaultfd testcase in the kernel isn't exercising this case unfortunately, that should probably be improved too, so there is a simpler way to reproduce than running precopy before postcopy in qemu. Thanks, Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>