On Thu, 12 Dec 2024 10:55:45 +0000, Ryan Roberts <ryan.roberts@xxxxxxx> wrote: > > On 12/12/2024 08:25, Marc Zyngier wrote: > >> + > >> + local_flush_tlb_all(); > > > > The elephant in the room: if TLBs are in such a sorry state, what > > guarantees we can make it this far? > > I'll leave Miko to respond to your other comments, but I wanted to address this > one, since it's pretty fundamental. We went around this loop internally and > concluded that what we are doing is architecturally sound. > > The expectation is that a conflict abort can only be generated as a result of > the change in patch 4 (and patch 5). That change makes it possible for the TLB > to end up with a multihit. But crucially that can only happen for user space > memory because that change only operates on user memory. And while the TLB may > detect the conflict at any time, the conflict abort is only permitted to be > reported when an architectural access is prevented by the conflict. So we never > do anything that would allow a conflict for a kernel memory access and a user > memory conflict abort can never be triggered as a result of accessing kernel memory. > > Copy/pasting comment from AlexC on the topic, which explains it better than I can: > > """ > The intent is certainly that in cases where the hardware detects a TLB conflict > abort, it is only permitted to report it (by generating an exception) if it > applies to an access that is being attempted architecturally. ... that property > can be built from the following two properties: > > 1. The TLB conflict can only be reported as an Instruction Abort or a Data Abort > > 2. Those two exception types must be reported synchronously and precisely. > """ I totally agree with this. The issue is that nothing says that the abort is in any way related to userspace. > > > > I honestly don't think you can reliably handle a TLB Conflict abort in > > the same translation regime as the original fault, given that we don't > > know the scope of that fault. You are probably making an educated > > guess that it is good enough on the CPUs you know of, but I don't see > > anything in the architecture that indicates the "blast radius" of a > > TLB conflict. > > OK, so I'm claiming that the blast radius is limited to the region of memory > that we are operating on in contpte_collapse() in patch 4. Perhaps we need to go > re-read the ARM and come back with the specific statements that led us to that > conclusion? But we don't know for sure what caused this conflict by the time we arrive in the handler. It could equally be because we have a glaring bug somewhere on the kernel side, even if you are *now* only concerned with userspace. If anything, this should absolutely check for FAR_EL1 and assert that this is indeed caused by such change. Thanks, M. -- Without deviation from the norm, progress is not possible.