Re: [RFC PATCH 3/3] mm/migrate: Create move_phys_pages syscall

Gregory Price <gregory.price@xxxxxxxxxxxx> · Tue, 19 Sep 2023 11:57:34 -0400

On Tue, Sep 19, 2023 at 02:17:15AM +0200, Thomas Gleixner wrote:
> On Thu, Sep 07 2023 at 03:54, Gregory Price wrote:
> > Similar to the move_pages system call, instead of taking a pid and
> > list of virtual addresses, this system call takes a list of physical
> > addresses.
> 
> Silly question. Where are these physical addresses coming from?
> 
> In my naive understanding user space deals with virtual addresses for a
> reason.
> 
> Exposing access to physical addresses is definitely helpful to write
> more powerful exploits, so what are the restriction applied to this?
> 

There are a variety of sources from which userspace can acquire physical
addresses, most require SYS_CAP_ADMIN.

I should probably add this list explicitly to the RFC for clarification:

1) /proc/pid/pagemap: can be used to do page table translation.
     This is only really useful for testing, and it requires
     CAP_SYS_ADMIN.

2) X86:  IBS (AMD) and PEBS (Intel) can be configured to return
     physical, virtual, or both physical and virtual adressing.
     This requires CAP_SYS_ADMIN.

3) zoneinfo:  /proc/zoneinfo exposes the start PFN of zones. Not largely
     useful in this context, but noteably it does expose a PFN does not
     require privilege.

4) /sys/kernel/mm/page_idle:  A way to query whether a PFN is idle.
     One tiering strategy can be to use idle information to move
     less-used pages to "lower tiers" of memory.

     With page_idle, you can query page_idle migrate based on PFN,
     without the need to know which process it belongs to.  Obviously
     migrations must still respect memcg restrictions, which is why
     iterating through each mapping VMA is required.

     https://docs.kernel.org/admin-guide/mm/idle_page_tracking.html

5) CXL (Proposed): a CXL memory device on the PCIe bus may provide
     hot/cold information about its memory.  If a heatmap is provided,
     for example, it would have to be a device address (0-based) or a
     physical address (some base defined by the kernel and programmed
     into device decoders such that it can convert them to 0-based).

     This is presently being proposed but has not been agreed upon yet.

> > +	if (flags & ~(MPOL_MF_MOVE|MPOL_MF_MOVE_ALL))
> > +		return -EINVAL;
> > +
> > +	if ((flags & MPOL_MF_MOVE_ALL) && !capable(CAP_SYS_NICE))
> > +		return -EPERM;
> 
> According to this logic here MPOL_MF_MOVE is unrestricted, right?
> 
> But how is an unpriviledged process knowing which physical address the
> pages have? Confused....
> 

Someone else pointed this out as well, I should have a SYS_CAP_ADMIN
check here that i neglected to add, I have a v2 of the RFC with a test
and fixes, and I am working on a sample man page.

This entire syscall should require SYS_CAP_ADMIN, though there may be an
argument for CAP_SYS_NICE.  I personally am of the opinion that ADMIN is
the right choice, because you're right that operations on physical
addresses are a major security issue.

> > +	/* All tasks mapping each page is checked in phys_page_migratable */
> > +	nodes_setall(target_nodes);
> 
> How is the comment related to nodes_setall() and why is nodes_setall()
> unconditional when target_nodes is only used in the @nodes != NULL case?
> 

Yeah this comment is confusing.  This is a bit of a wart that comes from
trying to re-use the existing do_pages_move to ensure correctness, and
I really should have added this information directly in the comments.

The relevant code is here:

static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
                         unsigned long nr_pages,
                         const void __user * __user *pages,
                         const int __user *nodes,
                         int __user *status, int flags)
{
...
	err = -EACCES;
	if (!node_isset(node, task_nodes))
		goto out_flush;
...
	if (mm)
		err = add_virt_page_for_migration(mm, p, current_node, &pagelist,
					     flags & MPOL_MF_MOVE_ALL);
	else
		err = add_phys_page_for_migration(p, current_node, &pagelist,
				flags & MPOL_MF_MOVE_ALL);

do_pages_move was originally written to check that the memcg contained
the node as a valid destination for the entire request. This nodemask
is aquired from find_mm_struct in a call to cpuset_mems_allowed(task).

e.g. if memcg.valid_nodes=0,1 and you attempt a move_pages(task, 2), then
the entire syscall will error out with -EACCES.

The first chunk above checks this condition.

The second chunk (add_virt/phys...), these calls check that individual
pages are actually migratable, and if so removes them from the LRU and
places them onto the list of pages to migrate in-bulk at a later time.

In the physical addressing path, there is no pid/task provided by the
user, because physical addresses are not associated directly with task.
We must look up each VMA, and each owner of that VMA, and validate
the intersection of the allowed nodes defined by that set contain the
requested node.

This is implemented here:

static int add_phys_page_for_migration(const void __user *p, int node,
                                       struct list_head *pagelist,
                                       bool migrate_all)
...
        pfn = ((unsigned long)p) >> PAGE_SHIFT;
        page = pfn_to_online_page(pfn);
        if (!page || PageTail(page))
                return -ENOENT;

        folio = phys_migrate_get_folio(page);
        if (folio) {
                rmap_walk(folio, &rwc); <---- per-vma walk
                folio_put(folio);
        }

static bool phys_page_migratable(struct folio *folio,
                                 struct vm_area_struct *vma,
                                 unsigned long address,
                                 void *arg)
{
        struct rmap_page_ctxt *ctxt = (struct rmap_page_ctxt *)arg;
        struct task_struct *owner = vma->vm_mm->owner;
        nodemask_t task_nodes = cpuset_mems_allowed(owner);
... the actual node check here ...
        ctxt->node_allowed &= node_isset(ctxt->node, task_nodes);

I will add some better comments that lay this out.

~Gregory