Re: [RESEND RFC PATCH v1 2/5] arm64: Add BBM Level 2 cpu feature

Ryan Roberts <ryan.roberts@xxxxxxx> · Thu, 12 Dec 2024 15:05:24 +0000



On 12/12/2024 14:26, Marc Zyngier wrote:
> On Thu, 12 Dec 2024 10:55:45 +0000,
> Ryan Roberts <ryan.roberts@xxxxxxx> wrote:
>>
>> On 12/12/2024 08:25, Marc Zyngier wrote:
>>>> +
>>>> +	local_flush_tlb_all();
>>>
>>> The elephant in the room: if TLBs are in such a sorry state, what
>>> guarantees we can make it this far?
>>
>> I'll leave Miko to respond to your other comments, but I wanted to address this
>> one, since it's pretty fundamental. We went around this loop internally and
>> concluded that what we are doing is architecturally sound.
>>
>> The expectation is that a conflict abort can only be generated as a result of
>> the change in patch 4 (and patch 5). That change makes it possible for the TLB
>> to end up with a multihit. But crucially that can only happen for user space
>> memory because that change only operates on user memory. And while the TLB may
>> detect the conflict at any time, the conflict abort is only permitted to be
>> reported when an architectural access is prevented by the conflict. So we never
>> do anything that would allow a conflict for a kernel memory access and a user
>> memory conflict abort can never be triggered as a result of accessing kernel memory.
>>
>> Copy/pasting comment from AlexC on the topic, which explains it better than I can:
>>
>> """
>> The intent is certainly that in cases where the hardware detects a TLB conflict
>> abort, it is only permitted to report it (by generating an exception) if it
>> applies to an access that is being attempted architecturally. ... that property
>> can be built from the following two properties:
>>
>> 1. The TLB conflict can only be reported as an Instruction Abort or a Data Abort
>>
>> 2. Those two exception types must be reported synchronously and precisely.
>> """
> 
> I totally agree with this. The issue is that nothing says that the
> abort is in any way related to userspace.
> 
>>>
>>> I honestly don't think you can reliably handle a TLB Conflict abort in
>>> the same translation regime as the original fault, given that we don't
>>> know the scope of that fault. You are probably making an educated
>>> guess that it is good enough on the CPUs you know of, but I don't see
>>> anything in the architecture that indicates the "blast radius" of a
>>> TLB conflict.
>>
>> OK, so I'm claiming that the blast radius is limited to the region of memory
>> that we are operating on in contpte_collapse() in patch 4. Perhaps we need to go
>> re-read the ARM and come back with the specific statements that led us to that
>> conclusion?