On 10/3/19 5:21 PM, Peter Zijlstra wrote:
On Thu, Oct 03, 2019 at 09:11:45AM +0200, Peter Zijlstra wrote:
On Wed, Oct 02, 2019 at 10:33:15PM -0300, Leonardo Bras wrote:
....
And I still think all that wrong, you really shouldn't need to wait on
munmap().
I do have a patch that does something like that.
+#define __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR_FULL
+static inline pmd_t pmdp_huge_get_and_clear_full(struct mm_struct *mm,
+ unsigned long address, pmd_t *pmdp,
+ int full)
+{
+ bool serialize = true;
+ /*
+ * We don't need to serialze against a lockless page table walk if
+ * we are clearing the pmd due to task exit. For regular mnumap, we
+ * still need to serialize due the possibility of MADV_DONTNEED running
+ * parallel to a page fault which can convert a THP pte entry to a
+ * pointer to level 4 table.
+ * Here MADV_DONTNEED is removing the THP entry and the fault is filling
+ * a level 4 pte.
+ */
+ if (full == 1)
+ serialize = false;
+ return __pmdp_huge_get_and_clear(mm, address, pmdp, serialize);
}
if it is a fullmm flush we can skip that serialize, But for everything
else we need to serialize. MADV_DONTNEED is another case. I haven't sent
this yet, because I was trying to look at what it takes to switch that
MADV variant to take mmap_sem in write mode.
MADV_DONTNEED has caused us multiple issues due to the fact that it can
run in parallel to page fault. I am not sure whether we have a
known/noticeable performance gain in allowing that with mmap_sem held in
read mode.
-aneesh