On Friday 22 Apr 2022 at 20:41:47 (+0000), Oliver Upton wrote: > On Fri, Apr 22, 2022 at 04:00:45PM +0000, Quentin Perret wrote: > > On Thursday 21 Apr 2022 at 16:40:56 (+0000), Oliver Upton wrote: > > > The other option would be to not touch the subtree at all until the rcu > > > callback, as at that point software will not tweak the tables any more. > > > No need for atomics/spinning and can just do a boring traversal. > > > > Right that is sort of what I had in mind. Note that I'm still trying to > > make my mind about the overall approach -- I can see how RCU protection > > provides a rather elegant solution to this problem, but this makes the > > whole thing inaccessible to e.g. pKVM where RCU is a non-starter. > > Heh, figuring out how to do this for pKVM seemed hard hence my lazy > attempt :) > > > A > > possible alternative that comes to mind would be to have all walkers > > take references on the pages as they walk down, and release them on > > their way back, but I'm still not sure how to make this race-safe. I'll > > have a think ... > > Does pKVM ever collapse tables into blocks? That is the only reason any > of this mess ever gets roped in. If not I think it is possible to get > away with a rwlock with unmap on the write side and everything else on > the read side, right? > > As far as regular KVM goes we get in this business when disabling dirty > logging on a memslot. Guest faults will lazily collapse the tables back > into blocks. An equally valid implementation would be just to unmap the > whole memslot and have the guest build out the tables again, which could > work with the aforementioned rwlock. Apologies for the delay on this one, I was away for a while. Yup, that all makes sense. FWIW the pKVM use-case I have in mind is slightly different. Specifically, in the pKVM world the hypervisor maintains a stage-2 for the host, that is all identity mapped. So we use nice big block mappings as much as we can. But when a protected guest starts, the hypervisor needs to break down the host stage-2 blocks to unmap the 4K guest pages from the host (which is where the protection comes from in pKVM). And when the guest is torn down, the host can reclaim its pages, hence putting us in a position to coallesce its stage-2 into nice big blocks again. Note that none of this coallescing is currently implemented even in our pKVM prototype, so it's a bit unfair to ask you to deal with this stuff now, but clearly it'd be cool if there was a way we could make these things coexist and even ideally share some code...