Re: [HMM v16 04/15] mm/ZONE_DEVICE/unaddressable: add support for un-addressable device memory v2

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Jan 16, 2017 at 7:17 AM, Jerome Glisse <jglisse@xxxxxxxxxx> wrote:
> On Sun, Jan 15, 2017 at 11:05:43PM -0800, Dan Williams wrote:
>> On Thu, Jan 12, 2017 at 8:30 AM, Jérôme Glisse <jglisse@xxxxxxxxxx> wrote:
>> > This add support for un-addressable device memory. Such memory is hotpluged
>> > only so we can have struct page but we should never map them as such memory
>> > can not be accessed by CPU. For that reason it uses a special swap entry for
>> > CPU page table entry.
>> >
>> > This patch implement all the logic from special swap type to handling CPU
>> > page fault through a callback specified in the ZONE_DEVICE pgmap struct.
>> >
>> > Architecture that wish to support un-addressable device memory should make
>> > sure to never populate the kernel linar mapping for the physical range.
>> >
>> > This feature potentially breaks memory hotplug unless every driver using it
>> > magically predicts the future addresses of where memory will be hotplugged.
>> >
>> > Changed since v1:
>> >   - Add unaddressable memory resource descriptor enum
>> >   - Explain why memory hotplug can fail because of un-addressable memory
>>
>> How can we merge this with a known potential regression?
>
> This only affect people using HMM it does not affect people that do not
> use it. Like i said for time being people that want HMM are ok with that.
> I haven't look yet on how to allow struct page about MAX_PHYSMEM_BITS
>
>>
>> >
>> > Signed-off-by: Jérôme Glisse <jglisse@xxxxxxxxxx>
>> > Cc: Dan Williams <dan.j.williams@xxxxxxxxx>
>> > Cc: Ross Zwisler <ross.zwisler@xxxxxxxxxxxxxxx>
>> > ---
>> >  drivers/dax/pmem.c                |  4 +--
>> >  drivers/nvdimm/pmem.c             |  6 ++--
>> >  fs/proc/task_mmu.c                | 10 +++++-
>> >  include/linux/ioport.h            |  1 +
>> >  include/linux/memory_hotplug.h    |  7 ++++
>> >  include/linux/memremap.h          | 29 +++++++++++++++--
>> >  include/linux/swap.h              | 18 +++++++++--
>> >  include/linux/swapops.h           | 67 +++++++++++++++++++++++++++++++++++++++
>> >  kernel/memremap.c                 | 43 +++++++++++++++++++++++--
>> >  mm/Kconfig                        | 12 +++++++
>> >  mm/memory.c                       | 64 ++++++++++++++++++++++++++++++++++++-
>> >  mm/memory_hotplug.c               | 10 ++++--
>> >  mm/mprotect.c                     | 12 +++++++
>> >  tools/testing/nvdimm/test/iomap.c |  3 +-
>> >  14 files changed, 269 insertions(+), 17 deletions(-)
>> >
>> > diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c
>> > index 66af7b1..c50b58d 100644
>> > --- a/drivers/dax/pmem.c
>> > +++ b/drivers/dax/pmem.c
>> > @@ -111,8 +111,8 @@ static int dax_pmem_probe(struct device *dev)
>> >         if (rc)
>> >                 return rc;
>> >
>> > -       addr = devm_memremap_pages(dev, &res, &dax_pmem->ref,
>> > -                                  altmap, NULL, NULL);
>> > +       addr = devm_memremap_pages(dev, &res, &dax_pmem->ref, altmap,
>> > +                                  NULL, NULL, NULL, NULL, MEMORY_DEVICE);
>> >         if (IS_ERR(addr))
>> >                 return PTR_ERR(addr);
>> >
>> > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
>> > index f2f1904..8166a56 100644
>> > --- a/drivers/nvdimm/pmem.c
>> > +++ b/drivers/nvdimm/pmem.c
>> > @@ -282,7 +282,8 @@ static int pmem_attach_disk(struct device *dev,
>> >         pmem->pfn_flags = PFN_DEV;
>> >         if (is_nd_pfn(dev)) {
>> >                 addr = devm_memremap_pages(dev, &pfn_res, &q->q_usage_counter,
>> > -                                          altmap, NULL, NULL);
>> > +                                          altmap, NULL, NULL, NULL,
>> > +                                          NULL, MEMORY_DEVICE);
>> >                 pfn_sb = nd_pfn->pfn_sb;
>> >                 pmem->data_offset = le64_to_cpu(pfn_sb->dataoff);
>> >                 pmem->pfn_pad = resource_size(res) - resource_size(&pfn_res);
>> > @@ -292,7 +293,8 @@ static int pmem_attach_disk(struct device *dev,
>> >         } else if (pmem_should_map_pages(dev)) {
>> >                 addr = devm_memremap_pages(dev, &nsio->res,
>> >                                            &q->q_usage_counter,
>> > -                                          NULL, NULL, NULL);
>> > +                                          NULL, NULL, NULL, NULL,
>> > +                                          NULL, MEMORY_DEVICE);
>> >                 pmem->pfn_flags |= PFN_MAP;
>> >         } else
>> >                 addr = devm_memremap(dev, pmem->phys_addr,
>> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>> > index 958f325..9a6ab71 100644
>> > --- a/fs/proc/task_mmu.c
>> > +++ b/fs/proc/task_mmu.c
>> > @@ -535,8 +535,11 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
>> >                         } else {
>> >                                 mss->swap_pss += (u64)PAGE_SIZE << PSS_SHIFT;
>> >                         }
>> > -               } else if (is_migration_entry(swpent))
>> > +               } else if (is_migration_entry(swpent)) {
>> >                         page = migration_entry_to_page(swpent);
>> > +               } else if (is_device_entry(swpent)) {
>> > +                       page = device_entry_to_page(swpent);
>> > +               }
>> >         } else if (unlikely(IS_ENABLED(CONFIG_SHMEM) && mss->check_shmem_swap
>> >                                                         && pte_none(*pte))) {
>> >                 page = find_get_entry(vma->vm_file->f_mapping,
>> > @@ -699,6 +702,8 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
>> >
>> >                 if (is_migration_entry(swpent))
>> >                         page = migration_entry_to_page(swpent);
>> > +               if (is_device_entry(swpent))
>> > +                       page = device_entry_to_page(swpent);
>> >         }
>> >         if (page) {
>> >                 int mapcount = page_mapcount(page);
>> > @@ -1182,6 +1187,9 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
>> >                 flags |= PM_SWAP;
>> >                 if (is_migration_entry(entry))
>> >                         page = migration_entry_to_page(entry);
>> > +
>> > +               if (is_device_entry(entry))
>> > +                       page = device_entry_to_page(entry);
>> >         }
>> >
>> >         if (page && !PageAnon(page))
>> > diff --git a/include/linux/ioport.h b/include/linux/ioport.h
>> > index 6230064..d154a18 100644
>> > --- a/include/linux/ioport.h
>> > +++ b/include/linux/ioport.h
>> > @@ -130,6 +130,7 @@ enum {
>> >         IORES_DESC_ACPI_NV_STORAGE              = 3,
>> >         IORES_DESC_PERSISTENT_MEMORY            = 4,
>> >         IORES_DESC_PERSISTENT_MEMORY_LEGACY     = 5,
>> > +       IORES_DESC_UNADDRESSABLE_MEMORY         = 6,
>> >  };
>> >
>> >  /* helpers to define resources */
>> > diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
>> > index 3f50eb8..e7c5dc6 100644
>> > --- a/include/linux/memory_hotplug.h
>> > +++ b/include/linux/memory_hotplug.h
>> > @@ -285,15 +285,22 @@ extern int zone_for_memory(int nid, u64 start, u64 size, int zone_default,
>> >   * never relied on struct page migration so far and new user of might also
>> >   * prefer avoiding struct page migration.
>> >   *
>> > + * For device memory (which use ZONE_DEVICE) we want differentiate between CPU
>> > + * accessible memory (persitent memory, device memory on an architecture with a
>> > + * system bus that allow transparent access to device memory) and unaddressable
>> > + * memory (device memory that can not be accessed by CPU directly).
>> > + *
>> >   * New non device memory specific flags can be added if ever needed.
>> >   *
>> >   * MEMORY_REGULAR: regular system memory
>> >   * DEVICE_MEMORY: device memory create a ZONE_DEVICE zone for it
>> >   * DEVICE_MEMORY_ALLOW_MIGRATE: page in that device memory ca be migrated
>> > + * MEMORY_DEVICE_UNADDRESSABLE: un-addressable memory (CPU can not access it)
>> >   */
>> >  #define MEMORY_NORMAL 0
>> >  #define MEMORY_DEVICE (1 << 0)
>> >  #define MEMORY_DEVICE_ALLOW_MIGRATE (1 << 1)
>> > +#define MEMORY_DEVICE_UNADDRESSABLE (1 << 2)
>> >
>> >  extern int arch_add_memory(int nid, u64 start, u64 size, int flags);
>> >  extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
>> > diff --git a/include/linux/memremap.h b/include/linux/memremap.h
>> > index 582561f..4b9f02c 100644
>> > --- a/include/linux/memremap.h
>> > +++ b/include/linux/memremap.h
>> > @@ -35,31 +35,42 @@ static inline struct vmem_altmap *to_vmem_altmap(unsigned long memmap_start)
>> >  }
>> >  #endif
>> >
>> > +typedef int (*dev_page_fault_t)(struct vm_area_struct *vma,
>> > +                               unsigned long addr,
>> > +                               struct page *page,
>> > +                               unsigned flags,
>> > +                               pmd_t *pmdp);
>> >  typedef void (*dev_page_free_t)(struct page *page, void *data);
>> >
>> >  /**
>> >   * struct dev_pagemap - metadata for ZONE_DEVICE mappings
>> > + * @page_fault: callback when CPU fault on an un-addressable device page
>> >   * @page_free: free page callback when page refcount reach 1
>> >   * @altmap: pre-allocated/reserved memory for vmemmap allocations
>> >   * @res: physical address range covered by @ref
>> >   * @ref: reference count that pins the devm_memremap_pages() mapping
>> >   * @dev: host device of the mapping for debug
>> >   * @data: privata data pointer for page_free
>> > + * @flags: device memory flags (look for MEMORY_DEVICE_* memory_hotplug.h)
>> >   */
>> >  struct dev_pagemap {
>> > +       dev_page_fault_t page_fault;
>> >         dev_page_free_t page_free;
>> >         struct vmem_altmap *altmap;
>> >         const struct resource *res;
>> >         struct percpu_ref *ref;
>> >         struct device *dev;
>> >         void *data;
>> > +       int flags;
>> >  };
>>
>> dev_pagemap is only meant for get_user_pages() to do lookups of ptes
>> with _PAGE_DEVMAP and take a reference against the hosting device..
>
> And i want to build on top of that to extend _PAGE_DEVMAP to support
> a new usecase for unaddressable device memory.
>
>>
>> Why can't HMM use the typical vm_operations_struct fault path and push
>> more of these details to a driver rather than the core?
>
> Because the vm_operations_struct has nothing to do with the device.
> We are talking about regular vma here. Think malloc, mmap, share
> memory, ...  not about mmap(/dev/thedevice,...)
>
> So the vm_operations_struct is never under device control and we can
> not, nor want to, rely on that.

Can you explain more what's behind that "can not, nor want to"
statement? It seems to me that any awkwardness of moving to a
standalone device file interface is less than a maintaining a new /
parallel mm fault path through dev_pagemap.

>
> So what we looking for here is struct page that can behave mostly
> like anyother except that we do not want to allow GUP to take a
> reference almost exactly what ZONE_DEVICE already provide.
>
> So do you have any fundamental objections to this patchset ? And if
> so, how do you propose i solve the problem i am trying to address ?
> Because hardware exist today and without something like HMM we will
> not be able to support such hardware.

My pushback stems from it being a completely different use case for
devm_memremap_pages(), as evidenced by it growing from 4 arguments to
9, and the ongoing maintenance overhead of understanding HMM
requirements when updating the pmem usage of ZONE_DEVICE.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href



[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]