Re: [RESEND RFC PATCH v1 2/5] arm64: Add BBM Level 2 cpu feature

Marc Zyngier <maz@xxxxxxxxxx> · Thu, 12 Dec 2024 14:26:59 +0000

On Thu, 12 Dec 2024 10:55:45 +0000,
Ryan Roberts <ryan.roberts@xxxxxxx> wrote:
> 
> On 12/12/2024 08:25, Marc Zyngier wrote:
> >> +
> >> +	local_flush_tlb_all();
> > 
> > The elephant in the room: if TLBs are in such a sorry state, what
> > guarantees we can make it this far?
> 
> I'll leave Miko to respond to your other comments, but I wanted to address this
> one, since it's pretty fundamental. We went around this loop internally and
> concluded that what we are doing is architecturally sound.
> 
> The expectation is that a conflict abort can only be generated as a result of
> the change in patch 4 (and patch 5). That change makes it possible for the TLB
> to end up with a multihit. But crucially that can only happen for user space
> memory because that change only operates on user memory. And while the TLB may
> detect the conflict at any time, the conflict abort is only permitted to be
> reported when an architectural access is prevented by the conflict. So we never
> do anything that would allow a conflict for a kernel memory access and a user
> memory conflict abort can never be triggered as a result of accessing kernel memory.
> 
> Copy/pasting comment from AlexC on the topic, which explains it better than I can:
> 
> """
> The intent is certainly that in cases where the hardware detects a TLB conflict
> abort, it is only permitted to report it (by generating an exception) if it
> applies to an access that is being attempted architecturally. ... that property
> can be built from the following two properties:
> 
> 1. The TLB conflict can only be reported as an Instruction Abort or a Data Abort
> 
> 2. Those two exception types must be reported synchronously and precisely.
> """

I totally agree with this. The issue is that nothing says that the
abort is in any way related to userspace.

> > 
> > I honestly don't think you can reliably handle a TLB Conflict abort in
> > the same translation regime as the original fault, given that we don't
> > know the scope of that fault. You are probably making an educated
> > guess that it is good enough on the CPUs you know of, but I don't see
> > anything in the architecture that indicates the "blast radius" of a
> > TLB conflict.
> 
> OK, so I'm claiming that the blast radius is limited to the region of memory
> that we are operating on in contpte_collapse() in patch 4. Perhaps we need to go
> re-read the ARM and come back with the specific statements that led us to that
> conclusion?

But we don't know for sure what caused this conflict by the time we
arrive in the handler. It could equally be because we have a glaring
bug somewhere on the kernel side, even if you are *now* only concerned
with userspace.

If anything, this should absolutely check for FAR_EL1 and assert that
this is indeed caused by such change.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.