On 12/12/2024 14:26, Marc Zyngier wrote: > On Thu, 12 Dec 2024 10:55:45 +0000, > Ryan Roberts <ryan.roberts@xxxxxxx> wrote: >> >> On 12/12/2024 08:25, Marc Zyngier wrote: >>>> + >>>> + local_flush_tlb_all(); >>> >>> The elephant in the room: if TLBs are in such a sorry state, what >>> guarantees we can make it this far? >> >> I'll leave Miko to respond to your other comments, but I wanted to address this >> one, since it's pretty fundamental. We went around this loop internally and >> concluded that what we are doing is architecturally sound. >> >> The expectation is that a conflict abort can only be generated as a result of >> the change in patch 4 (and patch 5). That change makes it possible for the TLB >> to end up with a multihit. But crucially that can only happen for user space >> memory because that change only operates on user memory. And while the TLB may >> detect the conflict at any time, the conflict abort is only permitted to be >> reported when an architectural access is prevented by the conflict. So we never >> do anything that would allow a conflict for a kernel memory access and a user >> memory conflict abort can never be triggered as a result of accessing kernel memory. >> >> Copy/pasting comment from AlexC on the topic, which explains it better than I can: >> >> """ >> The intent is certainly that in cases where the hardware detects a TLB conflict >> abort, it is only permitted to report it (by generating an exception) if it >> applies to an access that is being attempted architecturally. ... that property >> can be built from the following two properties: >> >> 1. The TLB conflict can only be reported as an Instruction Abort or a Data Abort >> >> 2. Those two exception types must be reported synchronously and precisely. >> """ > > I totally agree with this. The issue is that nothing says that the > abort is in any way related to userspace. > >>> >>> I honestly don't think you can reliably handle a TLB Conflict abort in >>> the same translation regime as the original fault, given that we don't >>> know the scope of that fault. You are probably making an educated >>> guess that it is good enough on the CPUs you know of, but I don't see >>> anything in the architecture that indicates the "blast radius" of a >>> TLB conflict. >> >> OK, so I'm claiming that the blast radius is limited to the region of memory >> that we are operating on in contpte_collapse() in patch 4. Perhaps we need to go >> re-read the ARM and come back with the specific statements that led us to that >> conclusion?