Re: [LSF/MM/BPF TOPIC] Single Owner Memory

Pasha Tatashin <pasha.tatashin@xxxxxxxxxx> · Tue, 21 Feb 2023 12:20:38 -0500

On Tue, Feb 21, 2023 at 10:55 AM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
>
> On Mon, Feb 20, 2023 at 02:10:24PM -0500, Pasha Tatashin wrote:
> > The discussion should include the following topics:
> > -  Interaction with folio and the proposed struct page {memdesc}.
> > - Handling for migrate_pages() and friends.
> > - Handling for FOLL_PIN and FOLL_LONGTERM.
> > - What type of madvise() properties the som memory should handle
>
> Something I didn't see covered was how you'd want to handle memory
> pressure.  The answer for memdescs is that we'd treat each userspace

Indeed, this is something that should be covered. I had a few thoughts
about that, but it needs more work.
Some possibilities:

1. When memory is pressured we can migrate pages to normal memory, and
that would enable that memory to become swappable etc.
2. Teach in-memory compressions such as zswap/zram to work directly
with /dev/som.

> allocation as a single object; if you allocate a 256kB folio, that has
> one accessed bit (set every time any of the PTEs which reference that
> folio is accessed), one dirty bit, is aged on the LRU as a single unit
> and will be written to swap as a single unit.
>
> Assuming we're dealing with objects smaller than PMDs, we have a number
> of PTEs each of which has its own A and D bits, so we can determine
> at each revolution of the LRU clock whether it still makes sense to be
> treating the folio as a single unit, or whether pages in the first half
> of the folio are no longer being accessed and we should split the folio
> in half and age the two halves separately.

Interesting

>
> All of that is still theoretical; we don't allocate anon memory in sizes
> other than PAGE_SIZE and PMD size.  And we don't track page cache A and
> D bits to see whether the decision to allocate a particular page size was
> the right one (most page cache memory is never mapped into userspace, so
> it might be of limited value, but I'm sure we could track the equivalent
> information with read() and write()).

Thanks,
Pasha