Re: [PATCH v3 12/14] device-dax: compound pagemap support

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 7/15/21 12:36 AM, Dan Williams wrote:
> On Wed, Jul 14, 2021 at 12:36 PM Joao Martins <joao.m.martins@xxxxxxxxxx> wrote:
>>
>> Use the newly added compound pagemap facility which maps the assigned dax
>> ranges as compound pages at a page size of @align. Currently, this means,
>> that region/namespace bootstrap would take considerably less, given that
>> you would initialize considerably less pages.
>>
>> On setups with 128G NVDIMMs the initialization with DRAM stored struct
>> pages improves from ~268-358 ms to ~78-100 ms with 2M pages, and to less
>> than a 1msec with 1G pages.
>>
>> dax devices are created with a fixed @align (huge page size) which is
>> enforced through as well at mmap() of the device. Faults, consequently
>> happen too at the specified @align specified at the creation, and those
>> don't change through out dax device lifetime. MCEs poisons a whole dax
>> huge page, as well as splits occurring at the configured page size.
>>
> 
> Hi Joao,
> 
> With this patch I'm hitting the following with the 'device-dax' test [1].
> 
Ugh, I can reproduce it too -- apologies for the oversight.

This patch is not the culprit, the flaw is early in the series, specifically the fourth patch.

It needs this chunk below change on the fourth patch due to the existing elevated page ref
count at zone device memmap init. put_page() called here in memunmap_pages():

for (i = 0; i < pgmap->nr_ranges; i++)
	for_each_device_pfn(pfn, pgmap, i)
		put_page(pfn_to_page(pfn));

... on a zone_device compound memmap would otherwise always decrease head page refcount by
@geometry pfn amount (leading to the aforementioned splat you reported).

diff --git a/mm/memremap.c b/mm/memremap.c
index b0e7b8cf3047..79a883af788e 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -102,15 +102,15 @@ static unsigned long pfn_end(struct dev_pagemap *pgmap, int range_id)
        return (range->start + range_len(range)) >> PAGE_SHIFT;
 }

-static unsigned long pfn_next(unsigned long pfn)
+static unsigned long pfn_next(struct dev_pagemap *pgmap, unsigned long pfn)
 {
        if (pfn % 1024 == 0)
                cond_resched();
-       return pfn + 1;
+       return pfn + pgmap_pfn_geometry(pgmap);
 }

 #define for_each_device_pfn(pfn, map, i) \
-       for (pfn = pfn_first(map, i); pfn < pfn_end(map, i); pfn = pfn_next(pfn))
+       for (pfn = pfn_first(map, i); pfn < pfn_end(map, i); pfn = pfn_next(map, pfn))

 static void dev_pagemap_kill(struct dev_pagemap *pgmap)
 {

It could also get this hunk below, but it is sort of redundant provided we won't touch
tail page refcount through out the devmap pages lifetime. This setting of tail pages
refcount to zero was in pre-v5.14 series, but it got removed under the assumption it comes
from the page allocator (where tail pages are already zeroed in refcount).

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 96975edac0a8..469a7aa5cf38 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6623,6 +6623,7 @@ static void __ref memmap_init_compound(struct page *page, unsigned
long pfn,
                __init_zone_device_page(page + i, pfn + i, zone_idx,
                                        nid, pgmap);
                prep_compound_tail(page, i);
+               set_page_count(page + i, 0);

                /*
                 * The first and second tail pages need to



[Index of Archives]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Linux FS]     [Yosemite Forum]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Video 4 Linux]     [Device Mapper]     [Linux Resources]

  Powered by Linux