Re: [RFC PATCH 3/3] mm/migrate: Create move_phys_pages syscall

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]


On Fri, Sep 8, 2023, at 09:46, Gregory Price wrote:
> On Sat, Sep 09, 2023 at 10:03:33AM +0200, Arnd Bergmann wrote:
>> > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
>> > index 22bc6bc147f8..6860675a942f 100644
>> > --- a/include/linux/syscalls.h
>> > +++ b/include/linux/syscalls.h
>> > @@ -821,6 +821,11 @@ asmlinkage long sys_move_pages(pid_t pid, unsigned 
>> > long nr_pages,
>> >  				const int __user *nodes,
>> >  				int __user *status,
>> >  				int flags);
>> > +asmlinkage long sys_move_phys_pages(unsigned long nr_pages,
>> > +				    const void __user * __user *pages,
>> > +				    const int __user *nodes,
>> > +				    int __user *status,
>> > +				    int flags);
>> The prototype looks good from a portability point of view,
>> i.e. I made sure this is portable across CONFIG_COMPAT
>> architectures etc.
>> What I'm not sure about is whether the representation of physical
>> memory pages as a 'void *' is a good choice, I can see potential
>> problems with this:
>> - it's not really a pointer, but instead a shifted PFN number
>>   in the implementation
>> - physical addresses may be wider than pointers on 32-bit
>>   architectures with CONFIG_PHYS_ADDR_T_64BIT
> Hm, good points.
> I tried to keep the code close to move_pages for the sake of
> familiarity and ease of review, but the physical address length
> is not something i'd considered, and I'm not sure how exactly
> we would handle CONFIG_PHYS_ADDR_T_64BIT.  If you have an idea,
> I'm happy to run with it.

I think a pointer to '__u64' is the most appropriate here,
that is compatible between 32-bit and 64-bit architectures
and covers all addresses until we get architectures with
128-bit addressing.

Thinking about it more, I noticed an existing bug in
both sys_move_pages and your current version of
sys_move_phys_pages: the 'pages' array is in fact not
suitable for compat tasks and needs an is_compat_task
check to load a 32-bit address from compat userspace on
the "if (get_user(p, pages + i))" line.

> on address vs pfn:
> Would using PFNs cause issues with portability of userland code? User
> code that gets access to a physical address would have to convert
> that to a PFN, which would be kernel-defined.  That could be easy
> enough if the kernel exposed the shift size somewhere.

I don't think we currently use PFN anywhere in the syscall
ABI, so this introduces a new basic type into uapi headers that
is currently kernel internal.

A 32-bit PFN is always sufficient to represent all physical
addresses on native 32-bit kernel, but not necessarily for
compat user space on 64-bit kernels with an address space wider
than 44 bits (16 terabyte address range).

For the interface it would also require the same quirk
for compat tasks that I pointed out above that is missing
in sys_move_pages today.

A minor quirk of PFN values is that they require knowing
the page size to convert addresses, but I suppose you need
that anyway.

>> I'm also not sure where the user space caller gets the
>> physical addresses from, are those not intentionally kept
>> hidden from userspace?
> There are presently 4 places (that I know of), and 1 that is being
> proposed here in the near future
> 1) Generally: /proc/pid/pagemap can be used to do page table
>      translations.  I think this is only really useful for testing, 
>      since if you have the virtual address and pid you would use
>      move_pages, but it's certainly useful for testing this.
> 2) X86:  IBS (AMD) and PEBS (Intel) can be configured to return physical
>      address information instead of virtual address information. In fact
>      you can configure it to give you both the virtual and physical
>      address for a process.

Ah right. I see these already require CAP_SYS_ADMIN permissions
because of the risk of rowhammer or speculative execution attacks,
so I suppose users of move_phys_pages() would need additional
permissions compared to move_pages() to actually use that information.

> 3) zoneinfo:  /proc/zoneinfo exposes the start PFN of zones
> 4) /sys/kernel/mm/page_idle:  A way to query whether a PFN is idle.
>      While itself seemingly not useful, if the goal is to migrate
>      less-used pages to "lower tiers" of memory, then you can query
>      the bitmap, directly recover idle PFNs, and attempt to migrate
>      them (which may fail, for a variety of reasons).
> 5) CXL (Proposed): a CXL memory device on the PCIe bus may provide
>      hot/cold information about its memory.  If a heatmap is provided,
>      for example, it would have to be a device address (0-based) or a
>      physical address (some base defined by the kernel and programmed
>      into device decoders such that it can convert them to 0-based).
>      This is presently being proposed but has not been agreed upon yet.

These do not seem to be problematic from the ASLR perspective, so
I guess it may still be useful without CAP_SYS_ADMIN.


[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux