Re: [PATCH v3 12/14] device-dax: compound pagemap support

Joao Martins <joao.m.martins@xxxxxxxxxx> · Wed, 28 Jul 2021 19:59:17 +0100

On 7/28/21 7:51 PM, Dan Williams wrote:
> On Wed, Jul 28, 2021 at 2:36 AM Joao Martins <joao.m.martins@xxxxxxxxxx> wrote:
>>
>> On 7/28/21 12:51 AM, Dan Williams wrote:
>>> On Thu, Jul 15, 2021 at 5:01 AM Joao Martins <joao.m.martins@xxxxxxxxxx> wrote:
>>>> On 7/15/21 12:36 AM, Dan Williams wrote:
>>>>> On Wed, Jul 14, 2021 at 12:36 PM Joao Martins <joao.m.martins@xxxxxxxxxx> wrote:
>>>> This patch is not the culprit, the flaw is early in the series, specifically the fourth patch.
>>>>
>>>> It needs this chunk below change on the fourth patch due to the existing elevated page ref
>>>> count at zone device memmap init. put_page() called here in memunmap_pages():
>>>>
>>>> for (i = 0; i < pgmap->nr_ranges; i++)
>>>>         for_each_device_pfn(pfn, pgmap, i)
>>>>                 put_page(pfn_to_page(pfn));
>>>>
>>>> ... on a zone_device compound memmap would otherwise always decrease head page refcount by
>>>> @geometry pfn amount (leading to the aforementioned splat you reported).
>>>>
>>>> diff --git a/mm/memremap.c b/mm/memremap.c
>>>> index b0e7b8cf3047..79a883af788e 100644
>>>> --- a/mm/memremap.c
>>>> +++ b/mm/memremap.c
>>>> @@ -102,15 +102,15 @@ static unsigned long pfn_end(struct dev_pagemap *pgmap, int range_id)
>>>>         return (range->start + range_len(range)) >> PAGE_SHIFT;
>>>>  }
>>>>
>>>> -static unsigned long pfn_next(unsigned long pfn)
>>>> +static unsigned long pfn_next(struct dev_pagemap *pgmap, unsigned long pfn)
>>>>  {
>>>>         if (pfn % 1024 == 0)
>>>>                 cond_resched();
>>>> -       return pfn + 1;
>>>> +       return pfn + pgmap_pfn_geometry(pgmap);
>>>
>>> The cond_resched() would need to be fixed up too to something like:
>>>
>>> if (pfn % (1024 << pgmap_geometry_order(pgmap)))
>>>     cond_resched();
>>>
>>> ...because the goal is to take a break every 1024 iterations, not
>>> every 1024 pfns.
>>>
>>
>> Ah, good point.
>>
>>>>  }
>>>>
>>>>  #define for_each_device_pfn(pfn, map, i) \
>>>> -       for (pfn = pfn_first(map, i); pfn < pfn_end(map, i); pfn = pfn_next(pfn))
>>>> +       for (pfn = pfn_first(map, i); pfn < pfn_end(map, i); pfn = pfn_next(map, pfn))
>>>>
>>>>  static void dev_pagemap_kill(struct dev_pagemap *pgmap)
>>>>  {
>>>>
>>>> It could also get this hunk below, but it is sort of redundant provided we won't touch
>>>> tail page refcount through out the devmap pages lifetime. This setting of tail pages
>>>> refcount to zero was in pre-v5.14 series, but it got removed under the assumption it comes
>>>> from the page allocator (where tail pages are already zeroed in refcount).
>>>
>>> Wait, devmap pages never see the page allocator?
>>>
>> "where tail pages are already zeroed in refcount" this actually meant 'freshly allocated
>> pages' and I was referring to commit 7118fc2906e2 ("hugetlb: address ref count racing in
>> prep_compound_gigantic_page") that removed set_page_count() because the setting of page
>> ref count to zero was redundant.
> 
> Ah, maybe include that reference in the changelog?
> 
Yeap, will do.

>>
>> Albeit devmap pages don't come from page allocator, you know separate zone and these pages
>> aren't part of the regular page pools (e.g. accessible via alloc_pages()), as you are
>> aware. Unless of course, we reassign them via dax_kmem, but then the way we map the struct
>> pages would be regular without any devmap stuff.
> 
> Got it. I think with the back reference to that commit (7118fc2906e2)
> it resolves my confusion.
> 
>>
>>>>
>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>> index 96975edac0a8..469a7aa5cf38 100644
>>>> --- a/mm/page_alloc.c
>>>> +++ b/mm/page_alloc.c
>>>> @@ -6623,6 +6623,7 @@ static void __ref memmap_init_compound(struct page *page, unsigned
>>>> long pfn,
>>>>                 __init_zone_device_page(page + i, pfn + i, zone_idx,
>>>>                                         nid, pgmap);
>>>>                 prep_compound_tail(page, i);
>>>> +               set_page_count(page + i, 0);
>>>
>>> Looks good to me and perhaps a for elevated tail page refcount at
>>> teardown as a sanity check that the tail pages was never pinned
>>> directly?
>>>
>> Sorry didn't follow completely.
>>
>> You meant to set tail page refcount back to 1 at teardown if it was kept to 0 (e.g.
>> memunmap_pages() after put_page()) or that the refcount is indeed kept to zero after the
>> put_page() in memunmap_pages() ?
> 
> The latter, i.e. would it be worth it to check that a tail page did
> not get accidentally pinned instead of a head page? I'm also ok to
> leave out that sanity checking for now.
> 
What makes me not worry too much about the sanity checking is that this put_page is
supposed to disappear here:

https://lore.kernel.org/linux-mm/20210717192135.9030-3-alex.sierra@xxxxxxx/

.. in fact none the hunks here:

https://lore.kernel.org/linux-mm/f7217b61-c845-eaed-501e-c9e7067a6b87@xxxxxxxxxx/

None of them would matter, as there would no longer exist an elevated page refcount to
deal with.