On 6/2/21 1:36 AM, Dan Williams wrote: > On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@xxxxxxxxxx> wrote: >> >> dax devices are created with a fixed @align (huge page size) which >> is enforced through as well at mmap() of the device. Faults, >> consequently happen too at the specified @align specified at the >> creation, and those don't change through out dax device lifetime. >> MCEs poisons a whole dax huge page, as well as splits occurring at >> at the configured page size. > > This paragraph last... > /me nods >> >> Use the newly added compound pagemap facility which maps the >> assigned dax ranges as compound pages at a page size of @align. >> Currently, this means, that region/namespace bootstrap would take >> considerably less, given that you would initialize considerably less >> pages. > > This paragraph should go first... > /me nods >> >> On setups with 128G NVDIMMs the initialization with DRAM stored struct pages >> improves from ~268-358 ms to ~78-100 ms with 2M pages, and to less than >> a 1msec with 1G pages. > > This paragraph second... > /me nods > > The reason for this ordering is to have increasingly more detail as > the changelog is read so that people that don't care about the details > can get the main theme immediately, and others that wonder why > device-dax is able to support this can read deeper. > >> >> Signed-off-by: Joao Martins <joao.m.martins@xxxxxxxxxx> >> --- >> drivers/dax/device.c | 58 ++++++++++++++++++++++++++++++++++---------- >> 1 file changed, 45 insertions(+), 13 deletions(-) >> >> diff --git a/drivers/dax/device.c b/drivers/dax/device.c >> index db92573c94e8..e3dcc4ad1727 100644 >> --- a/drivers/dax/device.c >> +++ b/drivers/dax/device.c >> @@ -192,6 +192,43 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax, >> } >> #endif /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */ >> >> +static void set_page_mapping(struct vm_fault *vmf, pfn_t pfn, >> + unsigned long fault_size, >> + struct address_space *f_mapping) >> +{ >> + unsigned long i; >> + pgoff_t pgoff; >> + >> + pgoff = linear_page_index(vmf->vma, vmf->address >> + & ~(fault_size - 1)); > > I know you are just copying this style from whomever wrote it this way > originally, but that person (me) was wrong this should be: > > pgoff = linear_page_index(vmf->vma, ALIGN(vmf->address, fault_size)); > > ...you might do a lead-in cleanup patch before this one. > Yeap, will do. > >> + >> + for (i = 0; i < fault_size / PAGE_SIZE; i++) { >> + struct page *page; >> + >> + page = pfn_to_page(pfn_t_to_pfn(pfn) + i); >> + if (page->mapping) >> + continue; >> + page->mapping = f_mapping; >> + page->index = pgoff + i; >> + } >> +} >> + >> +static void set_compound_mapping(struct vm_fault *vmf, pfn_t pfn, >> + unsigned long fault_size, >> + struct address_space *f_mapping) >> +{ >> + struct page *head; >> + >> + head = pfn_to_page(pfn_t_to_pfn(pfn)); >> + head = compound_head(head); >> + if (head->mapping) >> + return; >> + >> + head->mapping = f_mapping; >> + head->index = linear_page_index(vmf->vma, vmf->address >> + & ~(fault_size - 1)); >> +} >> + >> static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf, >> enum page_entry_size pe_size) >> { >> @@ -225,8 +262,7 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf, >> } >> >> if (rc == VM_FAULT_NOPAGE) { >> - unsigned long i; >> - pgoff_t pgoff; >> + struct dev_pagemap *pgmap = pfn_t_to_page(pfn)->pgmap; > > The device should already know its pagemap... > > There is a distinction in dev_dax_probe() for "static" vs "dynamic" > pgmap, but once the pgmap is allocated it should be fine to assign it > back to dev_dax->pgmap in the "dynamic" case. That could be a lead-in > patch to make dev_dax->pgmap always valid. > I suppose you mean to always set dev_dax->pgmap at the end of the 'if (!pgmap)' in dev_dax_probe() after we allocate the pgmap. I will make this a separate cleanup patch as you suggested. >> >> /* >> * In the device-dax case the only possibility for a >> @@ -234,17 +270,10 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf, >> * mapped. No need to consider the zero page, or racing >> * conflicting mappings. >> */ >> - pgoff = linear_page_index(vmf->vma, vmf->address >> - & ~(fault_size - 1)); >> - for (i = 0; i < fault_size / PAGE_SIZE; i++) { >> - struct page *page; >> - >> - page = pfn_to_page(pfn_t_to_pfn(pfn) + i); >> - if (page->mapping) >> - continue; >> - page->mapping = filp->f_mapping; >> - page->index = pgoff + i; >> - } >> + if (pgmap->align > PAGE_SIZE) >> + set_compound_mapping(vmf, pfn, fault_size, filp->f_mapping); >> + else >> + set_page_mapping(vmf, pfn, fault_size, filp->f_mapping); >> } >> dax_read_unlock(id); >> >> @@ -426,6 +455,9 @@ int dev_dax_probe(struct dev_dax *dev_dax) >> } >> >> pgmap->type = MEMORY_DEVICE_GENERIC; >> + if (dev_dax->align > PAGE_SIZE) >> + pgmap->align = dev_dax->align; > > Just needs updates for whatever renames you do for the "compound > geometry" terminology rather than subtle side effects of "align". > > Other than that, looks good to me. > OK, will do. Thanks!