Re: [PATCH v4] mm/page_vma_mapped.c: Explicitly compare pfn for normal, hugetlbfs and THP page

Yang Shi <shy828301@xxxxxxxxx> · Wed, 2 Sep 2020 15:26:39 -0700

On Thu, Jan 16, 2020 at 2:29 AM Michal Hocko <mhocko@xxxxxxxxxx> wrote:
>
> On Wed 15-01-20 16:40:26, Mike Kravetz wrote:
> > On 1/15/20 1:31 AM, Michal Hocko wrote:
> > > On Tue 14-01-20 21:47:51, Li Xinhai wrote:
> > >> On 2020-01-14 at 17:25 Michal Hocko wrote:
> > >>> On Sat 11-01-20 10:18:05, Li Xinhai wrote:
> > >>>> When check_pte, pfn of normal, hugetlbfs and THP page need be compared.
> > >>>> The current implementation apply comparison as
> > >>>> - normal 4K page: page_pfn <= pfn < page_pfn + 1
> > >>>> - hugetlbfs page:  page_pfn <= pfn < page_pfn + HPAGE_PMD_NR
> > >>>> - THP page: page_pfn <= pfn < page_pfn + HPAGE_PMD_NR
> > >>>> in pfn_in_hpage. For hugetlbfs page, it should be
> > >>>> page_pfn == pfn
> > >>>>
> > >>>> Now, change pfn_in_hpage to pfn_is_match to highlight that comparison
> > >>>> is not only for THP and explicitly compare for these cases.
> > >>>
> > >>> Why is this important to do. I have asked and Mike had the same feeling
> > >>> that the patch is missing any real justification. Why do we need this
> > >>> change? It is great that you have dropped VM_BUG_ON btw.
> > >>>
> > >> I think it is important to make the code clear, as said, comparing hugetlbfs page
> > >> in range page_pfn <= pfn < page_pfn + HPAGE_PMD_NR is confusion.
> > >
> > > I am sorry but I do not really see a big confusion here. It is probably
> > > a matter of taste. If others consider this an improvement then I will
> > > not stand in the way but I consider the justification insuficient for
> > > merging.
> >
> > Perhaps I am confused, but the patch does update the check done for
> > hugetlb pages.  IIUC, previously there was no distinction between THP
> > pages and hugetlb pages in the check.  It is valid to pass in a sub-
> > page of a THP page, but not that of a hugetlb page.
> >
> > I do not believe it is possible for existing code to pass in a sub-page
> > of a hugetlb page.  And, no one has ever seen this as an issue.  This
> > is why I was questioning the need for the patch.
>
> Exactly and that is the reason I fail the see a point.
>
> > With all that said, the new code/check is better (more complete) than
> > the original.  It may not do anything for the current code base, but
> > it 'could' catch potential errors in future code.  Because of this, I
> > do consider the code an improvement.
>
> It adds a branch for something that doesn't happen and also to a code
> path which is quite internal to the MM AFAICS. That being said, if you
> believe this is an improvement then I will not stand in the way. But
> there are so many other places which could add checks that are not
> exercised...

I just saw a bad page bug on one our host with 4.14 kernel:

backtrace:
:BUG: Bad page map in process python3.6  pte:1036e24025 pmd:1dd743d067
:page:fffff62840db8900 count:105 mapcount:-35 mapping:ffff97d432e97ad0 index:0x1
:flags: 0xbfffe000001006c(referenced|uptodate|lru|active|mappedtodisk)
:raw: 0bfffe000001006c ffff97d432e97ad0 0000000000000001 00000069ffffffdc
:raw: dead000000000100 dead000000000200 0000000000000000 ffff97c58fc0e000
:page dumped because: bad pte
:page->mem_cgroup:ffff97c58fc0e000
:addr:00007f2ddffcc000 vm_flags:00000075 anon_vma:          (null)
mapping:ffff97d432e97ad0 index:7f
:file:libc-2.17.so fault:xfs_filemap_fault [xfs] mmap:xfs_file_mmap
[xfs] readpage:xfs_vm_readpage [xfs]
:do_exit+0x563/0xbb0
:? vfs_write+0x162/0x1a0
:do_group_exit+0x3a/0xa0
:file:libc-2.17.so fault:xfs_filemap_fault [xfs] mmap:xfs_file_mmap
[xfs] readpage:xfs_vm_readpage [xfs]
:SyS_exit_group+0x10/0x10
:do_syscall_64+0x60/0x110
:entry_SYSCALL_64_after_hwframe+0x3d/0xa2
:RIP: 0033:0x7fb2504091d9
:RSP: 002b:00007ffcf035db88 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
:RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb2504091d9
:RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
:RBP: 00007fb250706838 R08: 000000000000003c R09: 00000000000000e7
:R10: ffffffffffffff70 R11: 0000000000000246 R12: 00007fb250706838
:R13: 00007fb25070be80 R14: 0000000000000000 R15: 0000000000000000
:CPU: 42 PID: 454567 Comm: python3.6 Tainted: G    B   W
4.14.98-t5.el7.twitter.x86_64 #1

I just saw it once and it didn't happen on the newer kernel (maybe
just not happen yet). I can't tell why the mapcount could reach -35
since all page_remove_rmap() is protected by page table lock. Then I
looked into pvmw code, and suspected the PTE might be changed after
acquiring ptl.

With the old check it was fine as long as "page_pfn <= pfn < page_pfn
+ 1", but for the base page case, it means we might be dec'ing
mapcount for another page.

Then I came across this commit. I guess this should be able to solve
the problem, but unfortunately the problem is very rare so basically I
can't prove this commit could solve it.

If you agree my analysis, we may consider to backport this to stable tree.

> > Acked-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx>
>
> --
> Michal Hocko
> SUSE Labs
>