On Tue, May 22, 2018 at 9:19 PM, Naoya Horiguchi <n-horiguchi@xxxxxxxxxxxxx> wrote: > On Tue, May 22, 2018 at 07:40:09AM -0700, Dan Williams wrote: >> The madvise_inject_error() routine uses get_user_pages() to lookup the >> pfn and other information for injected error, but it fails to release >> that pin. >> >> The dax-dma-vs-truncate warning catches this failure with the following >> signature: >> >> Injecting memory failure for pfn 0x208900 at process virtual address 0x7f3908d00000 >> Memory failure: 0x208900: reserved kernel page still referenced by 1 users >> Memory failure: 0x208900: recovery action for reserved kernel page: Failed >> WARNING: CPU: 37 PID: 9566 at fs/dax.c:348 dax_disassociate_entry+0x4e/0x90 >> CPU: 37 PID: 9566 Comm: umount Tainted: G W OE 4.17.0-rc6+ #1900 >> [..] >> RIP: 0010:dax_disassociate_entry+0x4e/0x90 >> RSP: 0018:ffffc9000a9b3b30 EFLAGS: 00010002 >> RAX: ffffea0008224000 RBX: 0000000000208a00 RCX: 0000000000208900 >> RDX: 0000000000000001 RSI: ffff8804058c6160 RDI: 0000000000000008 >> RBP: 000000000822000a R08: 0000000000000002 R09: 0000000000208800 >> R10: 0000000000000000 R11: 0000000000208801 R12: ffff8804058c6168 >> R13: 0000000000000000 R14: 0000000000000002 R15: 0000000000000001 >> FS: 00007f4548027fc0(0000) GS:ffff880431d40000(0000) knlGS:0000000000000000 >> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> CR2: 000056316d5f8988 CR3: 00000004298cc000 CR4: 00000000000406e0 >> Call Trace: >> __dax_invalidate_mapping_entry+0xab/0xe0 >> dax_delete_mapping_entry+0xf/0x20 >> truncate_exceptional_pvec_entries.part.14+0x1d4/0x210 >> truncate_inode_pages_range+0x291/0x920 >> ? kmem_cache_free+0x1f8/0x300 >> ? lock_acquire+0x9f/0x200 >> ? truncate_inode_pages_final+0x31/0x50 >> ext4_evict_inode+0x69/0x740 >> >> Cc: <stable@xxxxxxxxxxxxxxx> >> Fixes: bd1ce5f91f54 ("HWPOISON: avoid grabbing the page count...") >> Cc: Michal Hocko <mhocko@xxxxxxxx> >> Cc: Andi Kleen <ak@xxxxxxxxxxxxxxx> >> Cc: Wu Fengguang <fengguang.wu@xxxxxxxxx> >> Signed-off-by: Dan Williams <dan.j.williams@xxxxxxxxx> >> --- >> mm/madvise.c | 11 ++++++++--- >> 1 file changed, 8 insertions(+), 3 deletions(-) >> >> diff --git a/mm/madvise.c b/mm/madvise.c >> index 4d3c922ea1a1..246fa4d4eee2 100644 >> --- a/mm/madvise.c >> +++ b/mm/madvise.c >> @@ -631,11 +631,13 @@ static int madvise_inject_error(int behavior, >> >> >> for (; start < end; start += PAGE_SIZE << order) { >> + unsigned long pfn; >> int ret; >> >> ret = get_user_pages_fast(start, 1, 0, &page); >> if (ret != 1) >> return ret; >> + pfn = page_to_pfn(page); >> >> /* >> * When soft offlining hugepages, after migrating the page >> @@ -651,17 +653,20 @@ static int madvise_inject_error(int behavior, >> >> if (behavior == MADV_SOFT_OFFLINE) { >> pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n", >> - page_to_pfn(page), start); >> + pfn, start); >> >> ret = soft_offline_page(page, MF_COUNT_INCREASED); >> + put_page(page); >> if (ret) >> return ret; >> continue; >> } >> + put_page(page); > > We keep the page count pinned after the isolation of the error page > in order to make sure that the error page is disabled and never reused. > This seems not explicit enough, so some comment should be helpful. As far as I can see this extra reference count to keep the page from being should be taken internal to memory_failure(), not assumed from the inject error path. I might be overlooking something, but I do not see who is responsible for taking this extra reference in the case where memory_failure() is called by the machine check code rather than madvise_inject_error()? > > BTW, looking at the kernel message like "Memory failure: 0x208900: > reserved kernel page still referenced by 1 users", memory_failure() > considers dav_pagemap pages as "reserved kernel pages" (MF_MSG_KERNEL). > If memory error handler recovers a dav_pagemap page in its special way, > we can define a new action_page_types entry like MF_MSG_DAX. > Reporting like "Memory failure: 0xXXXXX: recovery action for dax page: > Failed" might be helpful for end user's perspective. Sounds good, I'll take a look at this.