Re: [LSF/MM/BPF TOPIC] Non-lru page migration in a memdesc world

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Jan 07, 2025 at 12:27:57PM -0500, Zi Yan wrote:
> On 7 Jan 2025, at 11:55, David Hildenbrand wrote:
> 
> > On 07.01.25 17:48, Zi Yan wrote:
> >> On 7 Jan 2025, at 11:11, David Hildenbrand wrote:
> >>
> >>> Hi,
> >>>
> >>> one item on my todo list is making PageOffline pages to stop using "struct page" members except page->type and 1/2 flags, to prepare them for the memdesc future, to avoid unnecessary atomics, and to resolve some (so-far) theoretical issues with temporary speculative references.
> >>>
> >>> For example, the page->_refcount will always be 0 (frozen) for PageOffline pages, and they will get allocated/freed similar to how we allocate/free frozen pages for slab already. Once we move the refcount into "struct folio", they will not have a refcount at all anymore.
> >>>
> >>> One complication is balloon compaction: we allow for migrating PageOffline pages allocated in some memory ballooning implementations such as virtio-balloon.
> >>>
> >>> For that, we use the "non-lru page migration" framework and in that process we make use of ... way to many members of "struct page"/"struct folio" and rely on the refcount not being 0. For example, we certainly don't want to allocate memdescs for PageOffline pages just so some of them can be migrated.
> >>
> >> Then first thing is to make all get_new_folio functions be aware of PageOffline
> >> pages and be able to allocate a PageOffline page. IIUC, the current process
> >> is: 1) allocate a page from buddy allocator, 2) offline the new page during
> >> mops->migrate_page() and online the old page. The inflation and deflation
> >> in step 2 looks redundant if migrate_pages() can get PageOffline pages to
> >> begin with and put_page() can handle PageOffline page too.
> >
> > That might be one hacky way of handling offline pages, yes :)
> >
> > (the isolation step is tricky: for example, with page->lru gone we cannot even put these things into a list! Also, there is page isolation ...)
> >
> > I recall that the isolation step is required because we could have multiple parties trying to migrate the same page at the same time. So that must be handled as well.
> 
> OK, since page->lru is gone, migrate_pages() might not be suitable for these
> pages, unless we want to rewrite migrate_pages(), which might be desirable. :)
> Then, we could record PFNs instead, like what migrate_vma*() does, but I have
> not checked migrate_vma*() in details to tell the feasibility yet.

migrate_vma_*() (and migrate_device_range) require folios, but not page->lru as
they are designed to work with both normal LRU pages and ZONE_DEVICE pages which
don't have page->lru because it is used for something else.

To me that is the primary difference between migrate_vma_*() and migrate_pages()
is the latter requires LRU pages. It has long annoyed me that much of the
migrate_pages() logic is duplicated in migrate_vma_*() simply because the former
requires list_heads whilst the latter needs an array of PFNs to deal with a lack
of page->lru. It seems like it should be possible to converge these two code
paths.

> In terms of isolation, we can use PageIsolated flag and make sure it is
> in the remaining 1/2 flags. This flag can be used for other non-folio things
> too.

My understanding of the isolation step was that it was required to ensure
page->lru could be reused by the caller of folio_isolate_lru(), not specifically
to deal directly with multiple parties trying to migrate the same page. Although
perhaps we are saying the same thing if migration is the only time the page->lru
list_heads are used for putting pages on non-LRU lists.

Multiple parties migrating the same page is dealt with reference count checks
- ie. if the reference count doesn't match the "expected" value we assume some
other party is migrating it, pinning it, etc. and fail the migration.

> >
> >>
> >>>
> >>> While we converted non-lru page migration to work on folios (i.e., folio_movable_ops()) these things are not actually "folios" in the future, they can have different memdescs.
> >>>
> >>> So, how can we migrate non-lru things that are not folios while not relying on "struct folio" members, with minimal/no metadata overhead?
> >>
> >> Like I said above, if migrate_pages() is aware of PageOffline pages by allocating
> >> and putting them like normal folios, that could work.
> >>
> >> Or you can do what hugetlb migration does, adding a separate migrate_offlinepages()
> >> function to handle PageOffline pages. This probably can save you a lot of
> >> LRU page checks like mapping and locks, but it adds a special function. So
> >> tradeoffs.
> >>
> >>>
> >>> I have some ideas, but no complete solution yet; input about the requirements of other non-lru page migration use cases besides PageOffline will be interesting.
> >>>
> >>> ... and maybe, we have other non-folio things we'd want to migrate, and want to be prepared to handle them as well? (hint: leaf page tables?)
> >>
> >> If we have dedicated allocator for non-folio things and make migrate_pages()
> >> be aware of them, it should be doable.
> >
> > Note that I thought about similar things as you describe above, but part of the exercise will not be focusing on PageOffline pages, but having something more generic that can handle pages with actual page content, and that have to be properly isolated :)
> 
> Sure. IMHO, we will need dedicated allocation and free functions for
> these non-folio things, PageIsolated flag for isolation, a dedicated
> code path in migrate_pages() or migrate_vma*().

If you don't have page->lru I think it would make more sense to extend
migrate_vma_*() as that already allows migration of non-LRU pages and allows
page content to be copied.

I'm hoping to extend that in the near(ish) future to support large non-LRU
folios (ie. for (m)THP and THP). Part of the difficulty there is figuring out
what the API should look like as an array of PAGE_SIZE PFNs does not really
scale.

 - Alistair

> Best Regards,
> Yan, Zi
> 




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux