On Thu, Dec 22, 2022 at 08:58:40PM +0000, Oliver Upton wrote: > On Thu, Dec 22, 2022 at 09:01:15AM +0000, Marc Zyngier wrote: > > On Wed, 21 Dec 2022 17:46:24 +0000, Oliver Upton <oliver.upton@xxxxxxxxx> wrote: > > > - When UFFD is in use, translation faults are reported to userspace as > > > writes when from a RW memslot and reads when from an RO memslot. > > > > Not quite: translation faults are reported as reads if TCR_EL1.HA > > isn't set, and as writes if it is. Ignoring TCR_EL1.HD for a moment, > > this matches exactly the behaviour of the page-table walker, which > > will update the S1 PTs only if this bit is set. > > My bad, yes you're right. I conflated the use case here with the > architectural state. > > I'm probably being way too pedantic, but I just wanted to make sure we > agree about the ensuing subtlety. More below: > > > Or is it what userfaultfd does on its own? That'd be confusing... > > > > > > > > - S1 page table memory is spuriously marked as dirty, as we presume a > > > write immediately follows the translation fault. That isn't entirely > > > senseless, as it would mean both the target page and the S1 PT that > > > maps it are both old. This is nothing new I suppose, just weird. > > > > s/old/young/ ? > > > > I think you're confusing the PT access with the access that caused the > > PT access (I'll have that printed on a t-shirt, thank you very much). > > I'd buy it! > > > Here, we're not considering the cause of the PT access anymore. If > > TCR_EL1.HA is set, the S1 PT page will be marked as accessed even on a > > read, and only that page. > > I think this is where the disconnect might be. TCR_EL1.HA == 1 suggests > a write could possibly follow, but I don't think it requires it. The > page table walker must first load the S1 PTE before writing to it. > > From AArch64.S1Translate() (DDI0487H.a): > > (fault, descaddress, walkstate, descriptor) = AArch64.S1Walk(fault, walkparams, va, regime, > ss, acctype, iswrite, ispriv); > > [...] > > new_desc = descriptor; > if walkparams.ha == '1' && AArch64.FaultAllowsSetAccessFlag(fault) then > // Set descriptor AF bit > new_desc<10> = '1'; > > [...] > > // Either the access flag was clear or AP<2> is set > if new_desc != descriptor then > if regime == Regime_EL10 && EL2Enabled() then > s1aarch64 = TRUE; > s2fs1walk = TRUE; > aligned = TRUE; > iswrite = TRUE; > (s2fault, descupdateaddress) = AArch64.S2Translate(fault, descaddress, s1aarch64, > ss, s2fs1walk, AccType_ATOMICRW, > aligned, iswrite, ispriv); > > if s2fault.statuscode != Fault_None then > return (s2fault, AddressDescriptor UNKNOWN); > else > descupdateaddress = descaddress; > > (fault, mem_desc) = AArch64.MemSwapTableDesc(fault, descriptor, new_desc, > walkparams.ee, descupdateaddress) > > Buried in AArch64.S1Walk() is a stage-2 walk for a read to fetch the > descriptor. The second stage-2 walk for write is conditioned on having > already fetched the stage-1 descriptor and determining the AF needs > to be set. > > Relating back to UFFD: if we expect KVM to do exactly what hardware > does, UFFD should see an attempted read when the first walk fails > because of an S2 translation fault. Based on this patch, though, we'd > promote it to a write if TCR_EL1.HA == 1. > > This has the additional nuance of marking the S1 PT's IPA as dirty, even > though it might not actually have been written to. Having said that, > the false positive rate should be negligible given that S1 PTs ought to > account for a small amount of guest memory. Another false positive is TCR_EL1.HA == 1 and having the AF bit set in the PTE. This results on a write, when I don't think it should. > > Like I said before, I'm probably being unnecessarily pedantic :) It just > seems to me that the view we're giving userspace of S1PTW aborts isn't > exactly architectural and I want to make sure that is explicitly > intentional. > > -- > Thanks, > Oliver