Re: FYI: Possible HPFAR_EL2 corruption (LPAE guests on AArch64 hosts)

Jon Masters <jcm@xxxxxxxxxxxxxx> · Mon, 8 Jul 2019 09:14:19 -0400

On 7/8/19 9:04 AM, Marc Zyngier wrote:

> [Adding myself to the cc-list, for real this time! ;-)]

Hehe

> On 08/07/2019 13:16, Jon Masters wrote:
>> Hi Mark,
>>
>> Thanks for adding the CCs. See below for more.
>>
>> On 7/8/19 7:47 AM, Mark Rutland wrote:
>>> On Sun, Jul 07, 2019 at 11:39:46PM -0400, Jon Masters wrote:
>>>> Hi all,
>>>
>>> Hi Jon,
>>>
>>> [adding Marc and the kvm-arm list]
>>>
>>>> TLDR: We think $subject may be a hardware errata and we are
>>>> investigating. I was asked to drop a note to share my initial analysis
>>>> in case others have been experiencing similar problems with 32-bit VMs.
>>>>
>>>> The Fedora Arm 32-bit builders run as "armv7hl+lpae" (aarch32) LPAE
>>>> (VMSAv8-32 Long-descriptor table format in aarch32 execution state) VMs
>>>> on AArch64 hosts. Under certain conditions, those builders will "pause"
>>>> with the following obscure looking error message:
>>>>
>>>> kvm [10652]: load/store instruction decoding not implemented
>>>>
>>>> (which is caused by a fall-through in io_mem_abort, the code assumes
>>>> that if we couldn't find the guest memslot we're taking an IO abort)
>>>>
>>>> This has been happening on and off for more than a year, tickled further
>>>> by various 32-bit Fedora guest updates, leading to some speculation that
>>>> there was actually a problem with guest toolchains generating
>>>> hard-to-emulate complex load/store instruction sequences not handled in KVM.
>>>>
>>>> After extensive analysis, I believe instead that it appears on the
>>>> platform we are using in Fedora that a stage 2 fault (e.g. v8.0 software
>>>> access bit update in the host) taken during stage 1 guest page table
>>>> walk will result in an HPFAR_EL2 truncation to a 32-bit address instead
>>>> of the full 48-bit IPA in use due to aarch32 LPAE. I believe that this
>>>> is a hardware errata and have requested that the vendor investigate.
>>>>
>>>> Meanwhile, I have a /very/ nasty patch that checks the fault conditions
>>>> in kvm_handle_guest_abort and if they match (S1 PTW, etc.), does a
>>>> software walk through the guest page tables looking for a PTE that
>>>> matches with the lower part of the faulting address bits we did get
>>>> reported to the host, then re-injects the correct fault. With this
>>>> patch, the test builder stays up, albeit correcting various faults:
>>>>
>>>> [  143.670063] JCM: WARNING: Mismatched FIPA and PA translation detected!
>>>> [  143.748447] JCM: Hyper faulting far: 0x3deb0000
>>>> [  143.802808] JCM: Guest faulting far: 0xb6dce3c4 (gfn: 0x3deb)
>>>> [  143.871776] JCM: Guest TTBCR: 0xb5023500, TTBR0: 0x5b06cc40
>>>> [  143.938649] JCM: Guest PGD address: 0x5b06cc50
>>>> [  143.991962] JCM: Guest PGD: 0x5b150003
>>>> [  144.036925] JCM: Guest PMD address: 0x5b150db0
>>>> [  144.090238] JCM: Guest PMD: 0x43deb0003
>>>> [  144.136241] JCM: Guest PTE address: 0x43deb0e70
>>>> [  144.190604] JCM: Guest PTE: 0x42000043bb72fdf
>>>> [  144.242884] JCM: Manually translated as: 0xb6dce3c4->0x43bb72000
>>>> [  144.314972] JCM: Faulting IPA page: 0x3deb0000
>>>> [  144.368286] JCM: Faulting PTE page: 0x43deb0000
>>>> [  144.422641] JCM: Fault occurred while performing S1 PTW -fixing
>>>> [  144.493684] JCM: corrected fault_ipa: 0x43deb0000
>>>> [  144.550133] JCM: Corrected gfn: 0x43deb
>>>> [  144.596145] JCM: handle user_mem_abort
>>>> [  144.641155] JCM: ret: 0x1
>>>
>>> When the conditions are met, does the issue continue to trigger
>>> reliably?
>>
>> Yeah. But only for certain faults - seems to be specifically for stage 1
>> page table walks that cause a trap to stage 2.
> 
> Do we know for sure this is limited to the guest using LPAE? I
> appreciate that this is the configuration you're running in, but it
> would be an interesting data point to work out what is happening with
> small descriptors.

It appears to be so, yea. I believe it's a truncation errata on this
platform (X-Gene1), and the vendor is currently investigating for us.

>>
>>> e.g. if you return to the guest without fixing the fault, do you always
>>> see the truncation when taking the fault again?
>>
>> I believe so, but I need to specifically check that.
>>
>>> If you try the translation with an AT, does that work as expected? We've
>>> had to use that elsewhere; see __populate_fault_info() in
>>> arch/arm64/kvm/hyp/switch.c.
>>
>> Yea, I've seen that code for the other errata :) The problem is the
>> virtual address in the FAR is different from the one we ultimately have
>> a PA translation for. We take a fault when the hardware walker tries to
>> perform a load to (e.g.) the PTE leaf during the translation of the VA.
>> So the PTE itself is what we are trying to load, not the PA of the VA
>> that the guest userspace/kernel tried to load. Hence an AT won't work,
>> unless I'm missing something. My first thought had been to do that.
> 
> Ah, that's the bit I was missing: S1PTW not completing in S2 because of
> the access flag. Duh.

:)

> Random idea: an option (although not a desirable one) would be to change
> the way we handle page aging on the host by forcing an unmap at S2
> instead of twiddling the access flag. Does this change anything on your
> system?

Well now, that's an interesting idea. If I get chance today, I'll try.

(I'm kinda getting close to punting this to the vendor...it resulted in
a couple of all nighters last week trying to figure what the heck was
going on. BUT I do want to figure out what could be an actually
upstreamable quirk since Fedora is relying on this hardware)

Jon.
_______________________________________________
kvmarm mailing list
kvmarm@xxxxxxxxxxxxxxxxxxxxx
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm