On 23/08/2021 12:33, David Hildenbrand wrote: > On 23.08.21 12:16, Ralf Ramsauer wrote: >> >> >> On 23/08/2021 10:02, David Hildenbrand wrote: >>> On 20.08.21 15:13, Ralf Ramsauer wrote: >>>> Dear mm folks, >>>> >>>> I have an issue, where it would be great to have a COW-backed virtual >>>> memory area within an userspace process. I know there's the possibility >>>> to have a file-backed MAP_SHARED vma, which is later duplicated with >>>> MAP_PRIVATE, but that's not exactly what I'm looking for. >>>> >>>> Say I have an anonymous page-aligned VMA a, with MAP_PRIVATE and >>>> PROT_RW. Userspace happily writes to/reads from it. At some point in >>>> time, I want to 'snapshot' that single VMA within the context of the >>>> process and without the need to fork(). Say there's something like >>>> >>>> a = mmap(0, len, PROT_RW, MAP_ANON | MAP_POPULATE, -1, 0); >>>> [... fill a ...] >>>> >>>> b = mmdup(a, len, PROT_READ); >>>> >>>> b shall be the new base pointer of a new VMA that is backed by COW >>>> mechanisms. After mmdup, those regular COW mechanisms do the rest: both >>>> VMAs (a and b) will fault on subsequent writes and duplicate the >>>> previously shared physical mapping, pretty much what cow_fault or >>>> shared_fault does. >>>> >>>> Afaict, this, or at least something like this is currently not >>>> supported >>>> by the kernel. Is that correct? If so, why? Generally spoken, is it a >>>> bad idea? >>> >>> Not sure if it helps (most probably not), QEMU uses uffd-wp for >>> background snapshots of VM memory. It's different, though, as you'll >>> only have a single mapping and will be catching modifications to your >>> single mapping, such that you can "safe away" relevant snapshot pages >>> before any modifications. >> >> Thanks for the pointer, David. I'll have a look. >> >>> >>> You mention "both VMAs (a and b) will fault on subsequent writes", so >>> would you actually be allowing PROT_WRITE access to b ("snapshot")? >>> >> >> In general, yes, both should be allowed to be PROT_WRITE. So no matter >> "which side" causes the fault, simply both will lead to duplication. >> >> If it would make things easier, then it would also be absolutely fine to >> have the snapshot PROT_READ, which would suffice my requirements as well. > > I recall that Redis has very similar requirements for live snapshotting. 100 points, you just managed to figure out what we're exackty working on! ;-) > They used to handle it via fork() just as you described as I was told. I Right, and fork() is damn slow, especially when forking large mappings. A simple mmap() of the same area (w/o population) is at least 4x faster. And you don't have to do all the stuff that's implied by fork, and you actually don't need. > don't know if they already switched to uffd-wp, but I would guess they > already did, because they were another excellent use case for uffd-wp > > https://lists.gnu.org/archive/html/qemu-devel/2016-10/msg02955.html > > You can handle COW manually in user space that way > > 1. Creating a second anonymous mapping > 2. Registering a UFFD-WP handler on the original mapping > 3. WP-protecting the original mapping via UFFD > 4. Tracking in a bitmap which pages were already copied Ok, great, thanks, I'll have a look into that one! > > So when you get notified about a WP event, you copy the page manually to > the second mapping, un-protect the page, and remember in the bitmap that > the page has been copied. > > When reading the snapshot, you have to take a look at the bitmap to > figure out if you have to read a specific page from the original, or > from the second mapping. But you won't be able to just read the second > mapping. (question would be, if that is really required or can be > worked-around) Thanks a bunch! Ralf