Re: [RFC PATCH 3/3] mm/migrate: Create move_phys_pages syscall

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]


On Sat, Sep 09, 2023 at 10:03:33AM +0200, Arnd Bergmann wrote:
> On Thu, Sep 7, 2023, at 09:54, Gregory Price wrote:
> > diff --git a/arch/x86/entry/syscalls/syscall_32.tbl 
> > b/arch/x86/entry/syscalls/syscall_32.tbl
> > index 2d0b1bd866ea..25db6d71af0c 100644
> > --- a/arch/x86/entry/syscalls/syscall_32.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> > @@ -457,3 +457,4 @@
> >  450	i386	set_mempolicy_home_node		sys_set_mempolicy_home_node
> >  451	i386	cachestat		sys_cachestat
> >  452	i386	fchmodat2		sys_fchmodat2
> > +454	i386	move_phys_pages		sys_move_phys_pages
> > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
> > b/arch/x86/entry/syscalls/syscall_64.tbl
> > index 1d6eee30eceb..9676f2e7698c 100644
> > --- a/arch/x86/entry/syscalls/syscall_64.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> > @@ -375,6 +375,7 @@
> >  451	common	cachestat		sys_cachestat
> >  452	common	fchmodat2		sys_fchmodat2
> >  453	64	map_shadow_stack	sys_map_shadow_stack
> > +454	common	move_phys_pages		sys_move_phys_pages
> Doing only x86 is fine for review purposes, but note that
> once there is consensus on actually merging it and the syscall
> number, you should do the same for all architectures. I think
> most commonly we do one patch to add the code and another
> patch to hook it up to all the syscall.tbl files and the
> include/uapi/asm-generic/unistd.h file.

I'd looked through a few prior patches and that's what I observed so I
tried to follow that.  For the RFC i think it only made sense to
integrate it with the system i'm actually testing on, but I'll
eventually need to test it on ARM and such.


> > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> > index 22bc6bc147f8..6860675a942f 100644
> > --- a/include/linux/syscalls.h
> > +++ b/include/linux/syscalls.h
> > @@ -821,6 +821,11 @@ asmlinkage long sys_move_pages(pid_t pid, unsigned 
> > long nr_pages,
> >  				const int __user *nodes,
> >  				int __user *status,
> >  				int flags);
> > +asmlinkage long sys_move_phys_pages(unsigned long nr_pages,
> > +				    const void __user * __user *pages,
> > +				    const int __user *nodes,
> > +				    int __user *status,
> > +				    int flags);
> The prototype looks good from a portability point of view,
> i.e. I made sure this is portable across CONFIG_COMPAT
> architectures etc.
> What I'm not sure about is whether the representation of physical
> memory pages as a 'void *' is a good choice, I can see potential
> problems with this:
> - it's not really a pointer, but instead a shifted PFN number
>   in the implementation
> - physical addresses may be wider than pointers on 32-bit
>   architectures with CONFIG_PHYS_ADDR_T_64BIT

Hm, good points.

I tried to keep the code close to move_pages for the sake of
familiarity and ease of review, but the physical address length
is not something i'd considered, and I'm not sure how exactly
we would handle CONFIG_PHYS_ADDR_T_64BIT.  If you have an idea,
I'm happy to run with it.

on address vs pfn:

Would using PFNs cause issues with portability of userland code? User
code that gets access to a physical address would have to convert
that to a PFN, which would be kernel-defined.  That could be easy
enough if the kernel exposed the shift size somewhere.

I can see arguments for PFN as opposed to physical address for
portability sake.  This doesn't handle the 64-bit phys address
configuration issue, of course.

In the case where a user already has a PFN (e.g. via mm/page_idle),
defining it as a PFN interface would certainly make more sense.

Mayhaps handling both cases would be useful, depending on the source
information (see below for more details).

> I'm also not sure where the user space caller gets the
> physical addresses from, are those not intentionally kept
> hidden from userspace?
>      Arnd

There are presently 4 places (that I know of), and 1 that is being
proposed here in the near future

1) Generally: /proc/pid/pagemap can be used to do page table
     translations.  I think this is only really useful for testing, 
     since if you have the virtual address and pid you would use
     move_pages, but it's certainly useful for testing this.

2) X86:  IBS (AMD) and PEBS (Intel) can be configured to return physical
     address information instead of virtual address information. In fact
     you can configure it to give you both the virtual and physical
     address for a process.

3) zoneinfo:  /proc/zoneinfo exposes the start PFN of zones

4) /sys/kernel/mm/page_idle:  A way to query whether a PFN is idle.
     While itself seemingly not useful, if the goal is to migrate
     less-used pages to "lower tiers" of memory, then you can query
     the bitmap, directly recover idle PFNs, and attempt to migrate
     them (which may fail, for a variety of reasons).

5) CXL (Proposed): a CXL memory device on the PCIe bus may provide
     hot/cold information about its memory.  If a heatmap is provided,
     for example, it would have to be a device address (0-based) or a
     physical address (some base defined by the kernel and programmed
     into device decoders such that it can convert them to 0-based).

     This is presently being proposed but has not been agreed upon yet.


[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux