Re: [PATCH 07/11] mm, madvise_inject_error: fix page count leak

Dan Williams <dan.j.williams@xxxxxxxxx> · Thu, 24 May 2018 13:55:04 -0700

On Tue, May 22, 2018 at 9:19 PM, Naoya Horiguchi
<n-horiguchi@xxxxxxxxxxxxx> wrote:
> On Tue, May 22, 2018 at 07:40:09AM -0700, Dan Williams wrote:
>> The madvise_inject_error() routine uses get_user_pages() to lookup the
>> pfn and other information for injected error, but it fails to release
>> that pin.
>>
>> The dax-dma-vs-truncate warning catches this failure with the following
>> signature:
>>
>>  Injecting memory failure for pfn 0x208900 at process virtual address 0x7f3908d00000
>>  Memory failure: 0x208900: reserved kernel page still referenced by 1 users
>>  Memory failure: 0x208900: recovery action for reserved kernel page: Failed
>>  WARNING: CPU: 37 PID: 9566 at fs/dax.c:348 dax_disassociate_entry+0x4e/0x90
>>  CPU: 37 PID: 9566 Comm: umount Tainted: G        W  OE     4.17.0-rc6+ #1900
>>  [..]
>>  RIP: 0010:dax_disassociate_entry+0x4e/0x90
>>  RSP: 0018:ffffc9000a9b3b30 EFLAGS: 00010002
>>  RAX: ffffea0008224000 RBX: 0000000000208a00 RCX: 0000000000208900
>>  RDX: 0000000000000001 RSI: ffff8804058c6160 RDI: 0000000000000008
>>  RBP: 000000000822000a R08: 0000000000000002 R09: 0000000000208800
>>  R10: 0000000000000000 R11: 0000000000208801 R12: ffff8804058c6168
>>  R13: 0000000000000000 R14: 0000000000000002 R15: 0000000000000001
>>  FS:  00007f4548027fc0(0000) GS:ffff880431d40000(0000) knlGS:0000000000000000
>>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>  CR2: 000056316d5f8988 CR3: 00000004298cc000 CR4: 00000000000406e0
>>  Call Trace:
>>   __dax_invalidate_mapping_entry+0xab/0xe0
>>   dax_delete_mapping_entry+0xf/0x20
>>   truncate_exceptional_pvec_entries.part.14+0x1d4/0x210
>>   truncate_inode_pages_range+0x291/0x920
>>   ? kmem_cache_free+0x1f8/0x300
>>   ? lock_acquire+0x9f/0x200
>>   ? truncate_inode_pages_final+0x31/0x50
>>   ext4_evict_inode+0x69/0x740
>>
>> Cc: <stable@xxxxxxxxxxxxxxx>
>> Fixes: bd1ce5f91f54 ("HWPOISON: avoid grabbing the page count...")
>> Cc: Michal Hocko <mhocko@xxxxxxxx>
>> Cc: Andi Kleen <ak@xxxxxxxxxxxxxxx>
>> Cc: Wu Fengguang <fengguang.wu@xxxxxxxxx>
>> Signed-off-by: Dan Williams <dan.j.williams@xxxxxxxxx>
>> ---
>>  mm/madvise.c |   11 ++++++++---
>>  1 file changed, 8 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/madvise.c b/mm/madvise.c
>> index 4d3c922ea1a1..246fa4d4eee2 100644
>> --- a/mm/madvise.c
>> +++ b/mm/madvise.c
>> @@ -631,11 +631,13 @@ static int madvise_inject_error(int behavior,
>>
>>
>>       for (; start < end; start += PAGE_SIZE << order) {
>> +             unsigned long pfn;
>>               int ret;
>>
>>               ret = get_user_pages_fast(start, 1, 0, &page);
>>               if (ret != 1)
>>                       return ret;
>> +             pfn = page_to_pfn(page);
>>
>>               /*
>>                * When soft offlining hugepages, after migrating the page
>> @@ -651,17 +653,20 @@ static int madvise_inject_error(int behavior,
>>
>>               if (behavior == MADV_SOFT_OFFLINE) {
>>                       pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n",
>> -                                             page_to_pfn(page), start);
>> +                                     pfn, start);
>>
>>                       ret = soft_offline_page(page, MF_COUNT_INCREASED);
>> +                     put_page(page);
>>                       if (ret)
>>                               return ret;
>>                       continue;
>>               }
>> +             put_page(page);
>
> We keep the page count pinned after the isolation of the error page
> in order to make sure that the error page is disabled and never reused.
> This seems not explicit enough, so some comment should be helpful.

As far as I can see this extra reference count to keep the page from
being should be taken internal to memory_failure(), not assumed from
the inject error path. I might be overlooking something, but I do not
see who is responsible for taking this extra reference in the case
where memory_failure() is called by the machine check code rather than
madvise_inject_error()?

>
> BTW, looking at the kernel message like "Memory failure: 0x208900:
> reserved kernel page still referenced by 1 users", memory_failure()
> considers dav_pagemap pages as "reserved kernel pages" (MF_MSG_KERNEL).
> If memory error handler recovers a dav_pagemap page in its special way,
> we can define a new action_page_types entry like MF_MSG_DAX.
> Reporting like "Memory failure: 0xXXXXX: recovery action for dax page:
> Failed" might be helpful for end user's perspective.

Sounds good, I'll take a look at this.