On Thu, Jan 13, 2022 at 9:30 AM Huang, Ying <ying.huang@xxxxxxxxx> wrote: > > "Huang, Ying" <ying.huang@xxxxxxxxx> writes: > > > Minchan Kim <minchan@xxxxxxxxxx> writes: > > > >> On Wed, Jan 12, 2022 at 06:53:07PM -0300, Mauricio Faria de Oliveira wrote: > >>> Hi Minchan Kim, > >>> > >>> Thanks for handling the hard questions! :) > >>> > >>> On Wed, Jan 12, 2022 at 2:33 PM Minchan Kim <minchan@xxxxxxxxxx> wrote: > >>> > > >>> > On Wed, Jan 12, 2022 at 09:46:23AM +0800, Huang, Ying wrote: > >>> > > Yu Zhao <yuzhao@xxxxxxxxxx> writes: > >>> > > > >>> > > > On Wed, Jan 05, 2022 at 08:34:40PM -0300, Mauricio Faria de Oliveira wrote: > >>> > > >> diff --git a/mm/rmap.c b/mm/rmap.c > >>> > > >> index 163ac4e6bcee..8671de473c25 100644 > >>> > > >> --- a/mm/rmap.c > >>> > > >> +++ b/mm/rmap.c > >>> > > >> @@ -1570,7 +1570,20 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > >>> > > >> > >>> > > >> /* MADV_FREE page check */ > >>> > > >> if (!PageSwapBacked(page)) { > >>> > > >> - if (!PageDirty(page)) { > >>> > > >> + int ref_count = page_ref_count(page); > >>> > > >> + int map_count = page_mapcount(page); > >>> > > >> + > >>> > > >> + /* > >>> > > >> + * The only page refs must be from the isolation > >>> > > >> + * (checked by the caller shrink_page_list() too) > >>> > > >> + * and one or more rmap's (dropped by discard:). > >>> > > >> + * > >>> > > >> + * Check the reference count before dirty flag > >>> > > >> + * with memory barrier; see __remove_mapping(). > >>> > > >> + */ > >>> > > >> + smp_rmb(); > >>> > > >> + if ((ref_count - 1 == map_count) && > >>> > > >> + !PageDirty(page)) { > >>> > > >> /* Invalidate as we cleared the pte */ > >>> > > >> mmu_notifier_invalidate_range(mm, > >>> > > >> address, address + PAGE_SIZE); > >>> > > > > >>> > > > Out of curiosity, how does it work with COW in terms of reordering? > >>> > > > Specifically, it seems to me get_page() and page_dup_rmap() in > >>> > > > copy_present_pte() can happen in any order, and if page_dup_rmap() > >>> > > > is seen first, and direct io is holding a refcnt, this check can still > >>> > > > pass? > >>> > > > >>> > > I think that you are correct. > >>> > > > >>> > > After more thoughts, it appears very tricky to compare page count and > >>> > > map count. Even if we have added smp_rmb() between page_ref_count() and > >>> > > page_mapcount(), an interrupt may happen between them. During the > >>> > > interrupt, the page count and map count may be changed, for example, > >>> > > unmapped, or do_swap_page(). > >>> > > >>> > Yeah, it happens but what specific problem are you concerning from the > >>> > count change under race? The fork case Yu pointed out was already known > >>> > for breaking DIO so user should take care not to fork under DIO(Please > >>> > look at O_DIRECT section in man 2 open). If you could give a specific > >>> > example, it would be great to think over the issue. > >>> > > >>> > I agree it's little tricky but it seems to be way other place has used > >>> > for a long time(Please look at write_protect_page in ksm.c). > >>> > >>> Ah, that's great to see it's being used elsewhere, for DIO particularly! > >>> > >>> > So, here what we missing is tlb flush before the checking. > >>> > >>> That shouldn't be required for this particular issue/case, IIUIC. > >>> One of the things we checked early on was disabling deferred TLB flush > >>> (similarly to what you've done), and it didn't help with the issue; also, the > >>> issue happens on uniprocessor mode too (thus no remote CPU involved.) > >> > >> I guess you didn't try it with page_mapcount + 1 == page_count at tha > >> time? Anyway, I agree we don't need TLB flush here like KSM. > >> I think the reason KSM is doing TLB flush before the check it to > >> make sure trap trigger on the write from userprocess in other core. > >> However, this MADV_FREE case, HW already gaurantees the trap. > >> Please see below. > >> > >>> > >>> > >>> > > >>> > Something like this. > >>> > > >>> > diff --git a/mm/rmap.c b/mm/rmap.c > >>> > index b0fd9dc19eba..b4ad9faa17b2 100644 > >>> > --- a/mm/rmap.c > >>> > +++ b/mm/rmap.c > >>> > @@ -1599,18 +1599,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, > >>> > > >>> > /* MADV_FREE page check */ > >>> > if (!PageSwapBacked(page)) { > >>> > - int refcount = page_ref_count(page); > >>> > - > >>> > - /* > >>> > - * The only page refs must be from the isolation > >>> > - * (checked by the caller shrink_page_list() too) > >>> > - * and the (single) rmap (dropped by discard:). > >>> > - * > >>> > - * Check the reference count before dirty flag > >>> > - * with memory barrier; see __remove_mapping(). > >>> > - */ > >>> > - smp_rmb(); > >>> > - if (refcount == 2 && !PageDirty(page)) { > >>> > + if (!PageDirty(page) && > >>> > + page_mapcount(page) + 1 == page_count(page)) { > >>> > >>> In the interest of avoiding a different race/bug, it seemed worth following the > >>> suggestion outlined in __remove_mapping(), i.e., checking PageDirty() > >>> after the page's reference count, with a memory barrier in between. > >> > >> True so it means your patch as-is is good for me. > > > > If my understanding were correct, a shared anonymous page will be mapped > > read-only. If so, will a private anonymous page be called > > SetPageDirty() concurrently after direct IO case has been dealt with > > via comparing page_count()/page_mapcount()? > > Sorry, I found that I am not quite right here. When direct IO read > completes, it will call SetPageDirty() and put_page() finally. And > MADV_FREE in try_to_unmap_one() needs to deal with that too. > > Checking direct IO code, it appears that set_page_dirty_lock() is used > to set page dirty, which will use lock_page(). > > dio_bio_complete > bio_check_pages_dirty > bio_dirty_fn /* through workqueue */ > bio_release_pages > set_page_dirty_lock > bio_release_pages > set_page_dirty_lock > > So in theory, for direct IO, the memory barrier may be unnecessary. But > I don't think it's a good idea to depend on this specific behavior of > direct IO. The original code with memory barrier looks more generic and > I don't think it will introduce visible overhead. > Thanks for all the considerations/thought process with potential corner cases! Regarding the overhead, agreed; and this is in memory reclaim which isn't a fast path (and even if it's under direct reclaim, things have slowed down already), so that would seem to be fine. cheers, > Best Regards, > Huang, Ying -- Mauricio Faria de Oliveira