On 7/8/19 9:09 AM, Mark Rutland wrote: > [Adding Marc for real this time] > > On Mon, Jul 08, 2019 at 08:16:25AM -0400, Jon Masters wrote: >> On 7/8/19 7:47 AM, Mark Rutland wrote: >>> On Sun, Jul 07, 2019 at 11:39:46PM -0400, Jon Masters wrote: >>>> TLDR: We think $subject may be a hardware errata and we are >>>> investigating. I was asked to drop a note to share my initial analysis >>>> in case others have been experiencing similar problems with 32-bit VMs. >>>> >>>> The Fedora Arm 32-bit builders run as "armv7hl+lpae" (aarch32) LPAE >>>> (VMSAv8-32 Long-descriptor table format in aarch32 execution state) VMs >>>> on AArch64 hosts. Under certain conditions, those builders will "pause" >>>> with the following obscure looking error message: >>>> >>>> kvm [10652]: load/store instruction decoding not implemented >>>> >>>> (which is caused by a fall-through in io_mem_abort, the code assumes >>>> that if we couldn't find the guest memslot we're taking an IO abort) >>>> >>>> This has been happening on and off for more than a year, tickled further >>>> by various 32-bit Fedora guest updates, leading to some speculation that >>>> there was actually a problem with guest toolchains generating >>>> hard-to-emulate complex load/store instruction sequences not handled in KVM. >>>> >>>> After extensive analysis, I believe instead that it appears on the >>>> platform we are using in Fedora that a stage 2 fault (e.g. v8.0 software >>>> access bit update in the host) taken during stage 1 guest page table >>>> walk will result in an HPFAR_EL2 truncation to a 32-bit address instead >>>> of the full 48-bit IPA in use due to aarch32 LPAE. I believe that this >>>> is a hardware errata and have requested that the vendor investigate. >>>> >>>> Meanwhile, I have a /very/ nasty patch that checks the fault conditions >>>> in kvm_handle_guest_abort and if they match (S1 PTW, etc.), does a >>>> software walk through the guest page tables looking for a PTE that >>>> matches with the lower part of the faulting address bits we did get >>>> reported to the host, then re-injects the correct fault. With this >>>> patch, the test builder stays up, albeit correcting various faults: >>>> >>>> [ 143.670063] JCM: WARNING: Mismatched FIPA and PA translation detected! >>>> [ 143.748447] JCM: Hyper faulting far: 0x3deb0000 >>>> [ 143.802808] JCM: Guest faulting far: 0xb6dce3c4 (gfn: 0x3deb) >>>> [ 143.871776] JCM: Guest TTBCR: 0xb5023500, TTBR0: 0x5b06cc40 >>>> [ 143.938649] JCM: Guest PGD address: 0x5b06cc50 >>>> [ 143.991962] JCM: Guest PGD: 0x5b150003 >>>> [ 144.036925] JCM: Guest PMD address: 0x5b150db0 >>>> [ 144.090238] JCM: Guest PMD: 0x43deb0003 >>>> [ 144.136241] JCM: Guest PTE address: 0x43deb0e70 >>>> [ 144.190604] JCM: Guest PTE: 0x42000043bb72fdf >>>> [ 144.242884] JCM: Manually translated as: 0xb6dce3c4->0x43bb72000 >>>> [ 144.314972] JCM: Faulting IPA page: 0x3deb0000 >>>> [ 144.368286] JCM: Faulting PTE page: 0x43deb0000 >>>> [ 144.422641] JCM: Fault occurred while performing S1 PTW -fixing >>>> [ 144.493684] JCM: corrected fault_ipa: 0x43deb0000 >>>> [ 144.550133] JCM: Corrected gfn: 0x43deb >>>> [ 144.596145] JCM: handle user_mem_abort >>>> [ 144.641155] JCM: ret: 0x1 >>> >>> When the conditions are met, does the issue continue to trigger >>> reliably? >> >> Yeah. But only for certain faults - seems to be specifically for stage 1 >> page table walks that cause a trap to stage 2. > > Ok. It sounds like we could write a small guest to trigger that > deliberately with some pre-allocated page tables placed above a 4GiB > IPA. Yea, indeed. It's funny what you realize as you're writing emails about it - was thinking that earlier :) Ok, that sounds like fun. >>> e.g. if you return to the guest without fixing the fault, do you always >>> see the truncation when taking the fault again? >> >> I believe so, but I need to specifically check that. >> >>> If you try the translation with an AT, does that work as expected? We've >>> had to use that elsewhere; see __populate_fault_info() in >>> arch/arm64/kvm/hyp/switch.c. >> >> Yea, I've seen that code for the other errata :) The problem is the >> virtual address in the FAR is different from the one we ultimately have >> a PA translation for. We take a fault when the hardware walker tries to >> perform a load to (e.g.) the PTE leaf during the translation of the VA. >> So the PTE itself is what we are trying to load, not the PA of the VA >> that the guest userspace/kernel tried to load. Hence an AT won't work, >> unless I'm missing something. My first thought had been to do that. > > My bad; I thought a failed AT reported the relevant IPA when it failed > as a result of a stage-2 fault, but I see now that it does not. Random aside - it would be great if there were an AT variant that did :) > I don't think that we can reliably walk the guest's Stage-1 tables > without trapping TLB invalidations (and/or stopping all vCPUs), so > that's rather unfortunate. Indeed. In the Fedora case, it's only a single vCPU in each guest so they effectively already do that (and hence my test hack "works") but that's another thing that would need to be handled for a real fix. Jon. _______________________________________________ kvmarm mailing list kvmarm@xxxxxxxxxxxxxxxxxxxxx https://lists.cs.columbia.edu/mailman/listinfo/kvmarm