Sorry for the late reply, it's been a couple of crazy weeks, and I'm
trying to give at least some feedback on stuff in my inbox before
even
more will pile up over Christmas :) . Let me summarize my thoughts:
My turn for the lateness - back from a break.
I should also preface that Mike is off for at least a month more, but
he will return to continue working on this. In the meantime, I've had
a chat with him about this work to keep the discussion alive on the
lists.
So now it's my turn to being late again ;) As promised during the last
call, a few points from my side.
THPs in Linux rely on the following principle:
(1) We try allocating a THP, if that fails we rely on khugepaged to
fix
it up later (shmem+anon). So id we cannot grab a free THP, we
deffer it to a later point.
(2) We try to be as transparent as possible: punching a hole will
usually destroy the THP (either immediately for shmem/pagecache
or
deferred for anon memory) to free up the now-free pages. That's
different to hugetlb, where partial hole-punching will always
zero-
out the memory only; the partial memory will not get freed up
and
will get reused later.
Destroying a THP for shmem/pagecache only works if there are no
unexpected page references, so there can be cases where we fail
to
free up memory. For the pagecache that's not really
an issue, because memory reclaim will fix that up at some point.
For
shmem, there were discussions to do scan for 0ed pages and free
them up during memory reclaim, just like we do now for anon
memory
as well.
(3) Memory compaction is vital for guaranteeing that we will be able
to
create THPs the longer the system was running,
With guest_memfd we cannot rely on any daemon to fix it up as in (1)
for
us later (would require page memory migration support).
True. And not having a huge page when requested to begin with (as in 1
above) beats the purpose entirely -- the point is to speed up SEV-SNP
setup and guests by having fewer pages to work with.
Right.
We use truncate_inode_pages_range(), which will split a THP into
small
pages if you partially punch-hole it, so (2) would apply; splitting
might fail as well in some cases if there are unexpected references.
I wonder what would happen if user space would punch a hole in
private
memory, making truncate_inode_pages_range() overwrite it with 0s if
splitting the THP failed (memory write to private pages under TDX?).
Maybe something similar would happen if a private page would get 0-ed
out when freeing+reallocating it, not sure how that is handled.
guest_memfd currently actively works against (3) as soon as we (A)
fallback to allocating small pages or (B) split a THP due to hole
punching, as the remaining fragments cannot get reassembled anymore.
I assume there is some truth to "hole-punching is a userspace
policy",
but this mechanism will actively work against itself as soon as you
start falling back to small pages in any way.
So I'm wondering if a better start would be to (A) always allocate
huge
pages from the buddy (no fallback) and
that sounds fine..
(B) partial punches are either
disallowed or only zero-out the memory. But even a sequence of
partial
punches that cover the whole huge page will not end up freeing all
parts
if splitting failed at some point, which I quite dislike ...
... this basically just looks like hugetlb support (i.e. without the
"transparent" part), isn't it?
Yes, just using a different allocator until we have a predictable
allocator with reserves.
Note that I am not sure how much "transparent" here really applies,
given the differences to THPs ...
But then we'd need memory preallocation, and I suspect to make this
really useful -- just like with 2M/1G "hugetlb" support -- in-place
shared<->private conversion will be a requirement. ... at which point
we'd have reached the state where it's almost the 2M hugetlb support.
Right, exactly.
This is not a very strong push back, more a "this does not quite
sound
right to me" and I have the feeling that this might get in the way of
in-place shared<->private conversion; I might be wrong about the
latter
though.
As discussed in the last bi-weekly MM meeting (and in contrast to what I
assumed), Vishal was right: we should be able to support in-place
shared<->private conversion as long as we can split a large folio when
any page of it is getting converted to shared.
(split is possible if there are no unexpected folio references; private
pages cannot be GUP'ed, so it is feasible)
So similar to the hugetlb work, that split would happen and would be a
bit "easier", because ordinary folios (in contrast to hugetlb) are
prepared to be split.
So supporting larger folios for private memory might not make in-place
conversion significantly harder; the important part is that shared
folios may only be small.
The split would just mean that we start exposing individual small folios
to the core-mm, not that we would allow page migration for the shared
parts etc. So the "whole 2M chunk" will remain allocated to guest_memfd.
TBH my 2c are that getting hugepage supported, and disabling THP for
SEV-SNP guests will work fine.
Likely it will not be that easy as soon as hugetlb reserves etc. will
come into play.
But as Mike mentioned above, this series is to add a user on top of
Paolo's work - and that seems more straightforward to experiment with
and figure out hugepage support in general while getting all the other
hugepage details done in parallel.
I would suggest to not call this "THP". Maybe we can call it "2M folio
support" for gmem.
Similar to other FSes, we could just not limit ourselves to 2M folios,
and simply allocate any large folios. But sticking to 2M might be
beneficial in regards to memory fragmentation (below).
With memory compaction working for guest_memfd, it would all be
easier.
... btw do you know how well this is coming along?
People have been talking about that, but I suspect this is very
long-term material.
Note that I'm not quite sure about the "2MB" interface, should it be
a
"PMD-size" interface?
I think Mike and I touched upon this aspect too - and I may be
misremembering - Mike suggested getting 1M, 2M, and bigger page sizes
in increments -- and then fitting in PMD sizes when we've had enough of
those. That is to say he didn't want to preclude it, or gate the PMD
work on enabling all sizes first.
Starting with 2M is reasonable for now. The real question is how we want
to deal with
(a) Not being able to allocate a 2M folio reliably
(b) Partial discarding
Using only (unmovable) 2M folios would effectively not cause any real
memory fragmentation in the system, because memory compaction operates
on 2M pageblocks on x86. So that feels quite compelling.
Ideally we'd have a 2M pagepool from which guest_memfd would allocate
pages and to which it would putback pages. Yes, this sound similar to
hugetlb, but might be much easier to implement, because we are not
limited by some of the hugetlb design decisions (HVO, not being able to
partially map them, etc.).
--
Cheers,
David / dhildenb