On Wed, Jan 12, 2022 at 7:37 PM Minchan Kim <minchan@xxxxxxxxxx> wrote: > > On Wed, Jan 12, 2022 at 06:53:07PM -0300, Mauricio Faria de Oliveira wrote: > > Hi Minchan Kim, > > > > Thanks for handling the hard questions! :) > > > > On Wed, Jan 12, 2022 at 2:33 PM Minchan Kim <minchan@xxxxxxxxxx> wrote: > > > > > > On Wed, Jan 12, 2022 at 09:46:23AM +0800, Huang, Ying wrote: > > > > Yu Zhao <yuzhao@xxxxxxxxxx> writes: > > > > > > > > > On Wed, Jan 05, 2022 at 08:34:40PM -0300, Mauricio Faria de Oliveira wrote: > > > > >> diff --git a/mm/rmap.c b/mm/rmap.c > > > > >> index 163ac4e6bcee..8671de473c25 100644 > > > > >> --- a/mm/rmap.c > > > > >> +++ b/mm/rmap.c > > > > >> @@ -1570,7 +1570,20 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > > > >> > > > > >> /* MADV_FREE page check */ > > > > >> if (!PageSwapBacked(page)) { > > > > >> - if (!PageDirty(page)) { > > > > >> + int ref_count = page_ref_count(page); > > > > >> + int map_count = page_mapcount(page); > > > > >> + > > > > >> + /* > > > > >> + * The only page refs must be from the isolation > > > > >> + * (checked by the caller shrink_page_list() too) > > > > >> + * and one or more rmap's (dropped by discard:). > > > > >> + * > > > > >> + * Check the reference count before dirty flag > > > > >> + * with memory barrier; see __remove_mapping(). > > > > >> + */ > > > > >> + smp_rmb(); > > > > >> + if ((ref_count - 1 == map_count) && > > > > >> + !PageDirty(page)) { > > > > >> /* Invalidate as we cleared the pte */ > > > > >> mmu_notifier_invalidate_range(mm, > > > > >> address, address + PAGE_SIZE); > > > > > > > > > > Out of curiosity, how does it work with COW in terms of reordering? > > > > > Specifically, it seems to me get_page() and page_dup_rmap() in > > > > > copy_present_pte() can happen in any order, and if page_dup_rmap() > > > > > is seen first, and direct io is holding a refcnt, this check can still > > > > > pass? > > > > > > > > I think that you are correct. > > > > > > > > After more thoughts, it appears very tricky to compare page count and > > > > map count. Even if we have added smp_rmb() between page_ref_count() and > > > > page_mapcount(), an interrupt may happen between them. During the > > > > interrupt, the page count and map count may be changed, for example, > > > > unmapped, or do_swap_page(). > > > > > > Yeah, it happens but what specific problem are you concerning from the > > > count change under race? The fork case Yu pointed out was already known > > > for breaking DIO so user should take care not to fork under DIO(Please > > > look at O_DIRECT section in man 2 open). If you could give a specific > > > example, it would be great to think over the issue. > > > > > > I agree it's little tricky but it seems to be way other place has used > > > for a long time(Please look at write_protect_page in ksm.c). > > > > Ah, that's great to see it's being used elsewhere, for DIO particularly! > > > > > So, here what we missing is tlb flush before the checking. > > > > That shouldn't be required for this particular issue/case, IIUIC. > > One of the things we checked early on was disabling deferred TLB flush > > (similarly to what you've done), and it didn't help with the issue; also, the > > issue happens on uniprocessor mode too (thus no remote CPU involved.) > > I guess you didn't try it with page_mapcount + 1 == page_count at tha > time? Anyway, I agree we don't need TLB flush here like KSM. Sorry, I fail to understand how the page (map) count and TLB flush would be related. (I realize you and Yu Zhao already confirmed the TLB flush is not needed/expected to fix the issue too; but just for my own education, if you have a chance.) > I think the reason KSM is doing TLB flush before the check it to > make sure trap trigger on the write from userprocess in other core. > However, this MADV_FREE case, HW already gaurantees the trap. Understood. > Please see below. > > > > > > > > > > > Something like this. > > > > > > diff --git a/mm/rmap.c b/mm/rmap.c > > > index b0fd9dc19eba..b4ad9faa17b2 100644 > > > --- a/mm/rmap.c > > > +++ b/mm/rmap.c > > > @@ -1599,18 +1599,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > > > > > > /* MADV_FREE page check */ > > > if (!PageSwapBacked(page)) { > > > - int refcount = page_ref_count(page); > > > - > > > - /* > > > - * The only page refs must be from the isolation > > > - * (checked by the caller shrink_page_list() too) > > > - * and the (single) rmap (dropped by discard:). > > > - * > > > - * Check the reference count before dirty flag > > > - * with memory barrier; see __remove_mapping(). > > > - */ > > > - smp_rmb(); > > > - if (refcount == 2 && !PageDirty(page)) { > > > + if (!PageDirty(page) && > > > + page_mapcount(page) + 1 == page_count(page)) { > > > > In the interest of avoiding a different race/bug, it seemed worth following the > > suggestion outlined in __remove_mapping(), i.e., checking PageDirty() > > after the page's reference count, with a memory barrier in between. > > True so it means your patch as-is is good for me. That's good news! Thanks for all your help, review, and discussion so far; it's been very educational. I see Yu Zhao mentioned a possible concern/suggestion with additional memory barriers elsewhere. I'll try and dig to understand/check that in more detail and follow up. > > > > > I'm not familiar with the details of the original issue behind that code change, > > but it seemed to be possible here too, particularly as writes from user-space > > can happen asynchronously / after try_to_unmap_one() checked PTE clean > > and didn't set PageDirty, and if the page's PTE is present, there's no fault? > > Yeah, it was discussed. > > For clean pte, CPU has to fetch and update the actual pte entry, not TLB > so trap triggers for MADV_FREE page. > > https://lkml.org/lkml/2015/4/15/565 > https://lkml.org/lkml/2015/4/16/136 Thanks for the pointers; great reading. cheers, -- Mauricio Faria de Oliveira