On Wed, Sep 22, 2021 at 2:52 PM Peter Xu <peterx@xxxxxxxxxx> wrote: > > On Wed, Sep 22, 2021 at 01:54:53PM -0700, Axel Rasmussen wrote: > > On Wed, Sep 22, 2021 at 10:33 AM Peter Xu <peterx@xxxxxxxxxx> wrote: > > > > > > Hello, Axel, > > > > > > On Wed, Sep 22, 2021 at 10:04:03AM -0700, Axel Rasmussen wrote: > > > > Thanks for discussing the design Peter. I have some ideas which might > > > > make for a nicer v2; I'll massage the code a bit and see what I can > > > > come up with. > > > > > > Sure thing. Note again that as I don't have a strong opinion on that, feel > > > free to keep it. However if you provide v2, I'll read. > > > > > > [off-topic below] > > > > > > Another thing I probably have forgot but need your confirmation is, when you > > > worked on uffd minor mode, did you explicitly disable thp, or is it allowed? > > > > I gave a more detailed answer in the other thread, but: currently it > > is allowed, but this was a bug / oversight on my part. :) THP collapse > > can break the guarantees minor fault registration is trying to > > provide. > > I've replied there: > > https://lore.kernel.org/linux-mm/YUueOUfoamxOvEyO@t490s/ > > We can try to keep the discussion unified there regarding this. > > > But there's another scenario: what if the collapse happened well > > before registration happened? > > Maybe yes, but my understanding of the current uffd-minor scenario tells me > that this is fine too. Meanwhile I actually have another idea regarding minor > mode, please continue reading. > > Firstly, let me try to re-cap on how minor mode is used in your production > systems: I believe there should have two processes A and B, if A is the main > process, B could be the migration process. B migrates pages in the background, > while A so far should have been stopped and never ran. When we want to start > A, we should register A with uffd-minor upon the whole range (note: I think so > far A does not have any pgtable mapped within uffd-minor range). Then any page > access of A should kick B and asking "whether it is the latest page", if yes > then UFFDIO_CONTINUE, if no then B modifies the page, plus UFFDIO_CONTINUE > afterwards. Am I right above? > > So if that's the case, then A should have no page table at all. > > Then, is that a problem if the shmem file that A maps contains huge thps? I > think no - because UFFDIO_CONTINUE will only install small pages. > > Let me know if I'm understanding it right above; I'll be happy to be corrected. Right, except that our use case is even more similar to QEMU: the code doing UFFDIO_CONTINUE / demand paging, and the code running the vCPUs, are in the same process (same mm) - just different threads. > > Actually besides this scenario, I'm also thinking of another scenario of using > minor fault in a single process - that's mostly what QEMU is doing right now, > as QEMU has the vcpu threads and migration thread sharing a single mm/pgtable. > So I think it'll be great to have a new madvise(MADV_ZAP) which will tear down > all the file-backed memory pgtables of a specific range. I think it'll suite > perfectly for the minor fault use case, and it can be used for other things > too. Let me know what you think about this idea, and whether that'll help in > your case too (e.g., if you worry a current process A mapped huge shmem thp > somewhere, we can use madvise(MADV_ZAP) to drop it). Yes, this would be convenient for our implementation too. :) There are workarounds if the feature doesn't exist, but it would be nice to have. It's also useful for memory poisoning, I think, if the host decides some page(s) are "bad" and wants to intercept any future guest accesses to those page(s). > > > I *think* the existing code deals with THPs correctly in that case, but then > > again I don't think our selftest really covers this case, and it's not > > something I've tested in production either (to work around the other bug, we > > currently MADV_NOHUGEPAGE the area until after VM demand paging completes, > > and the UFFD registration is removed), so I am not super confident this is > > the case. > > In all cases, enhancing the test program will always be welcomed. > > Thanks, > > -- > Peter Xu >