Re: [PATCH 0/6] rust: page: Support borrowing `struct page` and physaddr conversion

David Hildenbrand <david@xxxxxxxxxx> · Tue, 4 Feb 2025 12:59:35 +0100

Add DavidH and OscarS for memory hot-remove questions.

IIUC, struct page could be freed if a chunk of memory is hot-removed.

Right, but only after there are no users anymore (IOW, memory was freed
back to the buddy). PFN walkers might still stumble over them, but I
would not expect (or recommend) rust to do that.

The physaddr to page function does look up pages by pfn, but it's
intended to be used by drivers that know what they're doing. There are
two variants of the API, one that is completely unchecked (a fast path
for cases where the driver definitely allocated these pages itself, for
example just grabbing the `struct page` back from a decoded PTE it
wrote), and one that has this check:

pfn_valid(pfn) && page_is_ram(pfn)

Which is intended as a safety net to allow drivers to look up
firmware-reserved pages too, and fail gracefully if the kernel doesn't
know about them (because they weren't declared in the
bootloader/firmware memory map correctly) or doesn't have them mapped in
the direct map (because they were declared no-map).
> > Is there anything else that can reasonably be done here to make the API
safer to call on an arbitrary pfn?

In PFN walkers we use pfn_to_online_page() to make sure that (1) the 
memmap really exists; and (2) that it has meaning (e.g., was actually 
initialized).

It can still race with memory offlining, and it refuses ZONE_DEVICE 
pages. For the latter, we have a different way to check validity. See 
memory_failure() that first calls pfn_to_online_page() to then check 
get_dev_pagemap().

If the answer is "no" then that's fine. It's still an unsafe function
and we need to document in the safety section that it should only be
used for memory that is either known to be allocated and pinned and will
not be freed while the `struct page` is borrowed, or memory that is
reserved and not owned by the buddy allocator, so in practice correct
use would not be racy with memory hot-remove anyway.

This is already the case for the drm/asahi use case, where the pfns
looked up will only ever be one of:

- GEM objects that are mapped to the GPU and whose physical pages are
therefore pinned (and the VM is locked while this happens so the objects
cannot become unpinned out from under the running code),

How exactly are these pages pinned/obtained?

- Raw pages allocated from the page allocator for use as GPU page tables,

That makes sense.

- System memory that is marked reserved by the firmware/bootloader,

E.g., in arch/x86/mm/ioremap.c:__ioremap_check_ram() we refuse anything 
that has a valid memmap and is *not* marked as PageReserved, to prevent 
remapping arbitrary *real* RAM.

Is that case here similar?

- (Potentially) invalid PFNs that aren't part of the System RAM region
at all and don't have a struct page to begin with, which we check for,
so the API returns an error. This would only happen if the bootloader
didn't declare some used firmware ranges at all, so Linux doesn't know
about them.

Another case struct page can be freed is when hugetlb vmemmap
optimization
is used. Muchun (cc'd) is the maintainer of hugetlbfs.

Here, the "struct page" remains valid though; it can still be accessed,
although we disallow writes (which would be wrong).

If you only allocate a page and free it later, there is no need to worry
about either on the rust side.

This is what the safe API does. (Also the unsafe physaddr APIs if all
you ever do is convert an allocated page to a physaddr and back, which
is the only thing the GPU page table code does during normal use. The
walking leaf PFNs story is only for GPU device coredumps when the
firmware crashes.)

I would hope that we can lock down this interface as much as possible.

Ideally, we would never go from pfn->page, unless

(a) we remember somehow that we came from page->pfn. E.g., we allocated
    these pages or someone else provided us with these pages. The memmap
    cannot go away. I know it's hard.
(b) the pages are flagged as being special, similar to
    __ioremap_check_ram().

--
Cheers,

David / dhildenb