On Tue, Nov 7, 2023 at 9:24 AM Peter Xu <peterx@xxxxxxxxxx> wrote: > > On Tue, Nov 07, 2023 at 08:11:09AM -0800, James Houghton wrote: > > This extra ~8 bytes per page overhead is real, and it is the > > theoretical maximum additional overhead that userfaultfd would require > > over a KVM-based demand paging alternative when we are using > > hugepages. Consider the case where we are using THPs and have just > > finished post-copy, and we haven't done any collapsing yet: > > > > For userfaultfd: because we have UFFDIO_COPY'd or UFFDIO_CONTINUE'd at > > 4K (because we demand-fetched at 4K), the userspace page tables are > > entirely shattered. KVM has no choice but to have an entirely > > shattered second-stage page table as well. > > > > For KVM demand paging: the userspace page tables can remain entirely > > populated, so we get PMD mappings here. KVM, though, uses 4K SPTEs > > because we have only just finished post-copy and haven't started > > collapsing yet. > > > > So both systems end up with a shattered second stage page table, but > > userfaultfd has a shattered userspace page table as well (+8 bytes/4K > > if using THP, +another 8 bytes/2M if using HugeTLB-1G, etc.) and that > > is where the extra overhead comes from. > > > > The second mapping of guest memory that we use today (through which we > > install memory), given that we are using hugepages, will use PMDs and > > PUDs, so the overhead is minimal. > > > > Hope that clears things up! > > Ah I see, thanks James. Though, is this a real concern in production use, > considering worst case 0.2% overhead (all THP backed) and only exist during > postcopy, only on destination host? Good question. In an ideal world, 0.2% of lost memory isn't a huge deal, but it would be nice to save as much memory as possible. So I see this overhead point as a nice win for a KVM-based solution, but it is not a key deciding factor in what the right move is. (I think the key deciding factor is: what is the best way to make post-copy work for 1G pages?) To elaborate a little more: For Google, I don't think the 0.2% loss is a huge deal by itself (though I am not exactly an authority here). There are other memory overheads like this that we have to deal with anyway. The real challenge for us comes from the fact that we already have a post-copy system that works and has less overhead. If we were to replace KVM demand paging with userfaultfd, that means *regressing* in efficiency/performance. That's the main practical challenge: dealing with the regression. We have to make sure that VMs can still be packed to the appropriate efficiency, things like that. At this moment *I think* this is a solvable problem, but it would be nice to avoid the problem entirely. But this is Google's problem; I don't think this point should be the deciding factor here. - James