Re: [PATCH RFC] mm/madvise: introduce MADV_POPULATE to prefault/prealloc memory

David Hildenbrand <david@xxxxxxxxxx> · Mon, 22 Feb 2021 16:30:47 +0100

On 22.02.21 15:02, Michal Hocko wrote:
On Mon 22-02-21 14:22:37, David Hildenbrand wrote:
Exactly. But for hugetlbfs/shmem ("!RAM-backed files") this is not what we
want.

OK, then I must have misread your requirements. Maybe I just got lost in
all the combinations you have listed.

Another special case could be dax/pmem I think. You might want to fault it
in readable/writable but not perform an actual read/write unless really
required.

QEMU phrases this as "don't cause wear on the storage backing".

Sorry for being dense here but I still do not follow. If you do not want
to read then what do you want to populate from? Only map if it is in the

In the context of VMs it's usually rather a mean to preallocate backend 
storage - which would also happen on read access. See below on case 4).

page cache?

Let's try to untangle my thoughts regarding VMs. We could have as 
backend storage for our VM:

1) Anonymous memory
2) hugetlbfs (private/shared)
3) tmpfs/shmem (private/shared)
4) Ordinary files (shared)
5) DAX/PMEM (shared)

Excluding special cases (hypervisor upgrades with 2) and 3) ), we expect 
to have pre-existing content in files only in 4) and 5). 4) and 5) might 
be used as NVDIMM backend for a guest, or as DIMM backend.

The first access of our VM to memory could be
a) Write: the usual case when exposed as RAM/DIMM to out guest.
b) Read: possible case when exposed as an NVDIMM to our guest (we don't
   know). But eventually, we might write to (parts of) NVDIMMs later.

We "preallocate"/"populate" memory of our VM so that
- We know we have sufficient backend storage (esp. hugetlbfs, shmem,
  files) - so we don't randomly crash the VM. My most important use
  case.
- We avoid page faults (including page zeroing!) at runtime. Especially
  relevant for RT workloads.

With 1), 2), and 3) we want to have pages faulted in writable - we 
expect that our guest will write to that memory. MADV_POPULATE would do 
that only for 1), and MAP_PRIVATE of 2). For the shared parts, we would 
want MADV_POPULATE_WRITE semantics.

With 5), we already had complaints that preallcoation in QEMU takes a 
long time - because we end up actually reading/writing slow PMEM 
(libvirt now disables preallcoation for that reason, which makes sense). 
However, MADV_POPULATE_WRITE would help to prefault without actually 
reading/writing pmem - if we want to avoid any minor faults.

With 4), I think we primarily prealloc/prefault to make sure we have 
sufficient backend storage. fallocate() might do a better job just for 
the allocation. But if there is sufficient RAM it might make sense to 
prefault all guest RAM at least readable - then we only have a minor 
fault when the VM writes to it and might avoid having to go to disk. 
Prefaulting everything writable means that we *have to* write back all 
guest RAM even if the guest never accessed it. So I think there are 
cases where MADV_POPULATE_READ (current MADV_POPULATE) semantics could 
make sense.

--
Thanks,

David / dhildenb