Re: [PATCH v4] mm/page_vma_mapped.c: Explicitly compare pfn for normal, hugetlbfs and THP page

Yang Shi <shy828301@xxxxxxxxx> · Thu, 3 Sep 2020 09:08:15 -0700

On Thu, Sep 3, 2020 at 1:02 AM Michal Hocko <mhocko@xxxxxxxx> wrote:
>
> On Wed 02-09-20 15:26:39, Yang Shi wrote:
> > On Thu, Jan 16, 2020 at 2:29 AM Michal Hocko <mhocko@xxxxxxxxxx> wrote:
> > >
> > > On Wed 15-01-20 16:40:26, Mike Kravetz wrote:
> > > > On 1/15/20 1:31 AM, Michal Hocko wrote:
> > > > > On Tue 14-01-20 21:47:51, Li Xinhai wrote:
> > > > >> On 2020-01-14 at 17:25 Michal Hocko wrote:
> > > > >>> On Sat 11-01-20 10:18:05, Li Xinhai wrote:
> > > > >>>> When check_pte, pfn of normal, hugetlbfs and THP page need be compared.
> > > > >>>> The current implementation apply comparison as
> > > > >>>> - normal 4K page: page_pfn <= pfn < page_pfn + 1
> > > > >>>> - hugetlbfs page:  page_pfn <= pfn < page_pfn + HPAGE_PMD_NR
> > > > >>>> - THP page: page_pfn <= pfn < page_pfn + HPAGE_PMD_NR
> > > > >>>> in pfn_in_hpage. For hugetlbfs page, it should be
> > > > >>>> page_pfn == pfn
> > > > >>>>
> > > > >>>> Now, change pfn_in_hpage to pfn_is_match to highlight that comparison
> > > > >>>> is not only for THP and explicitly compare for these cases.
> > > > >>>
> > > > >>> Why is this important to do. I have asked and Mike had the same feeling
> > > > >>> that the patch is missing any real justification. Why do we need this
> > > > >>> change? It is great that you have dropped VM_BUG_ON btw.
> > > > >>>
> > > > >> I think it is important to make the code clear, as said, comparing hugetlbfs page
> > > > >> in range page_pfn <= pfn < page_pfn + HPAGE_PMD_NR is confusion.
> > > > >
> > > > > I am sorry but I do not really see a big confusion here. It is probably
> > > > > a matter of taste. If others consider this an improvement then I will
> > > > > not stand in the way but I consider the justification insuficient for
> > > > > merging.
> > > >
> > > > Perhaps I am confused, but the patch does update the check done for
> > > > hugetlb pages.  IIUC, previously there was no distinction between THP
> > > > pages and hugetlb pages in the check.  It is valid to pass in a sub-
> > > > page of a THP page, but not that of a hugetlb page.
> > > >
> > > > I do not believe it is possible for existing code to pass in a sub-page
> > > > of a hugetlb page.  And, no one has ever seen this as an issue.  This
> > > > is why I was questioning the need for the patch.
> > >
> > > Exactly and that is the reason I fail the see a point.
> > >
> > > > With all that said, the new code/check is better (more complete) than
> > > > the original.  It may not do anything for the current code base, but
> > > > it 'could' catch potential errors in future code.  Because of this, I
> > > > do consider the code an improvement.
> > >
> > > It adds a branch for something that doesn't happen and also to a code
> > > path which is quite internal to the MM AFAICS. That being said, if you
> > > believe this is an improvement then I will not stand in the way. But
> > > there are so many other places which could add checks that are not
> > > exercised...
> >
> > I just saw a bad page bug on one our host with 4.14 kernel:
> >
> > backtrace:
> > :BUG: Bad page map in process python3.6  pte:1036e24025 pmd:1dd743d067
> > :page:fffff62840db8900 count:105 mapcount:-35 mapping:ffff97d432e97ad0 index:0x1
> > :flags: 0xbfffe000001006c(referenced|uptodate|lru|active|mappedtodisk)
> > :raw: 0bfffe000001006c ffff97d432e97ad0 0000000000000001 00000069ffffffdc
> > :raw: dead000000000100 dead000000000200 0000000000000000 ffff97c58fc0e000
> > :page dumped because: bad pte
> > :page->mem_cgroup:ffff97c58fc0e000
> > :addr:00007f2ddffcc000 vm_flags:00000075 anon_vma:          (null)
> > mapping:ffff97d432e97ad0 index:7f
> > :file:libc-2.17.so fault:xfs_filemap_fault [xfs] mmap:xfs_file_mmap
> > [xfs] readpage:xfs_vm_readpage [xfs]
> > :do_exit+0x563/0xbb0
> > :? vfs_write+0x162/0x1a0
> > :do_group_exit+0x3a/0xa0
> > :file:libc-2.17.so fault:xfs_filemap_fault [xfs] mmap:xfs_file_mmap
> > [xfs] readpage:xfs_vm_readpage [xfs]
> > :SyS_exit_group+0x10/0x10
> > :do_syscall_64+0x60/0x110
> > :entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> > :RIP: 0033:0x7fb2504091d9
> > :RSP: 002b:00007ffcf035db88 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
> > :RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb2504091d9
> > :RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
> > :RBP: 00007fb250706838 R08: 000000000000003c R09: 00000000000000e7
> > :R10: ffffffffffffff70 R11: 0000000000000246 R12: 00007fb250706838
> > :R13: 00007fb25070be80 R14: 0000000000000000 R15: 0000000000000000
> > :CPU: 42 PID: 454567 Comm: python3.6 Tainted: G    B   W
> > 4.14.98-t5.el7.twitter.x86_64 #1
> >
> > I just saw it once and it didn't happen on the newer kernel (maybe
> > just not happen yet). I can't tell why the mapcount could reach -35
> > since all page_remove_rmap() is protected by page table lock. Then I
> > looked into pvmw code, and suspected the PTE might be changed after
> > acquiring ptl.
> >
> > With the old check it was fine as long as "page_pfn <= pfn < page_pfn
> > + 1", but for the base page case, it means we might be dec'ing
> > mapcount for another page.
> >
> > Then I came across this commit. I guess this should be able to solve
> > the problem, but unfortunately the problem is very rare so basically I
> > can't prove this commit could solve it.
> >
> > If you agree my analysis, we may consider to backport this to stable tree.
>
> I am not sure I follow. Your page looks like a normal page cache and the
> patch you are referring to is only clarifying hugetlb pages. I do not
> remember details but it shouldn't have any functional effect. Are those
> even used in your environemnt?

Not only does the code touch hugetlbfs page, please see the below code snippet:

+       /* normal page and hugetlbfs page */
+       if (!PageTransCompound(page) || PageHuge(page))
+               return page_pfn == pfn;

The potential problem per my understanding is pvmw does:

1. read pte
2. lock ptl
3. check pfn

During step #1 and #2 the PTE might be changed, it is not surprising
we typically have pte_same in page fault path to check this after
acquiring ptl, but for pvmw path a full pte_same check might be
unnecessary, since we just care if the pfn is intact or not.

But before the patch as long as "old_pfn < new_pfn < old_pfn + 512" is
true the check would pass for both normal base page and hugetlbfs page
even though the new pfn is changed, that commit tightened the check.
For the normal base page it must be "old_pfn == new_pfn".

> --
> Michal Hocko
> SUSE Labs