at 4:08 PM, Yang Shi <yang.shi@xxxxxxxxxxxxxxxxx> wrote: > > > On 6/19/18 3:17 PM, Nadav Amit wrote: >> at 4:34 PM, Yang Shi <yang.shi@xxxxxxxxxxxxxxxxx> >> wrote: >> >> >>> When running some mmap/munmap scalability tests with large memory (i.e. >>> >>>> 300GB), the below hung task issue may happen occasionally. >>>> >>> INFO: task ps:14018 blocked for more than 120 seconds. >>> Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1 >>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this >>> message. >>> ps D 0 14018 1 0x00000004 >>> >>> >> (snip) >> >> >>> Zapping pages is the most time consuming part, according to the >>> suggestion from Michal Hock [1], zapping pages can be done with holding >>> read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write >>> mmap_sem to manipulate vmas. >>> >> Does munmap() == MADV_DONTNEED + munmap() ? > > Not exactly the same. So, I basically copied the page zapping used by munmap instead of calling MADV_DONTNEED. > >> >> For example, what happens with userfaultfd in this case? Can you get an >> extra #PF, which would be visible to userspace, before the munmap is >> finished? >> > > userfaultfd is handled by regular munmap path. So, no change to userfaultfd part. Right. I see it now. > >> >> In addition, would it be ok for the user to potentially get a zeroed page in >> the time window after the MADV_DONTNEED finished removing a PTE and before >> the munmap() is done? >> > > This should be undefined behavior according to Michal. This has been discussed in https://lwn.net/Articles/753269/. Thanks for the reference. Reading the man page I see: "All pages containing a part of the indicated range are unmapped, and subsequent references to these pages will generate SIGSEGV.” To me it sounds pretty well-defined, and this implementation does not follow this definition. I would expect the man page to be updated and indicate that the behavior has changed. Regards, Nadav