Re: [BUG] ARM64 KVM: Data abort executing post-indexed LDR on MMIO address

Ahmad Fatoum <a.fatoum@xxxxxxxxxxxxxx> · Wed, 9 Oct 2024 08:11:52 +0200

Hello Marc,

On 06.10.24 12:28, Marc Zyngier wrote:
> On Sun, 06 Oct 2024 08:59:56 +0100,
> Ahmad Fatoum <a.fatoum@xxxxxxxxxxxxxx> wrote:
>> On 05.10.24 23:35, Marc Zyngier wrote:
>>> On Sat, 05 Oct 2024 19:38:23 +0100,
>>> Ahmad Fatoum <a.fatoum@xxxxxxxxxxxxxx> wrote:
>> One more question: This upgrading of DC IVAC to DC CIVAC is because
>> the code is run under virtualization, right?
> 
> Not necessarily. Virtualisation mandates the upgrade, but CIVAC is
> also a perfectly valid implementation of both IVAC and CVAC.  And it
> isn't uncommon that CPUs implement everything the same way.

Makes sense. After all, software should expect cache lines to
be evicted at any time due to capacity misses anyway.

>> I think following fix on the barebox side may work:
>>
>>   - Walk all pages about to be remapped
>>   - Execute the AT instruction on the page's base address
> 
> Why do you need AT if you are walking the PTs? If you walk, you
> already have access to the memory attributes. In general, AT can be
> slower than an actual walk.
>
> Or did you actually mean iterating over the VA range? Even in that
> case, AT can be a bad idea, as you are likely to iterate in page-size
> increments even if you have a block mapping. Walking the PTs tells you
> immediately how much a leaf is mapping (assuming you don't have any
> other tracking).

There's no other tracking and I hoped that using AT (which is already
being used for the mmuinfo shell command) would be easier.

I see now that it would be too suboptimal to do it this way and have
implemented a revised arch_remap_range[1] for barebox, which I just
Cc'd you on.

[1]: https://lore.kernel.org/barebox/20241009060511.4121157-5-a.fatoum@xxxxxxxxxxxxxx/T/#u

>>   - Only if the page was previously mapped cacheable, clean + invalidate
>>     the cache
>>   - Remove the current cache invalidation after remap
>>
>> Does that sound sensible?
> 
> This looks reasonable (apart from the AT thingy).

I have two (hopefully the last!) questions about remaining differing
behavior with KVM and without:

1) Unaligned stack accesses crash in KVM:

start: /* This will be mapped at 0x40080000 */
        ldr     x0, =0x4007fff0
        mov     sp, x0
        stp     x0, x1, [sp] // This is ok

        ldr     x0, =0x4007fff8
        mov     sp, x0
        stp     x0, x1, [sp] // This crashes

I know that the stack should be 16 byte aligned, but why does it crash
only under KVM?

Context: The barebox Image used for Qemu has a Linux ARM64 "Image" header,
so it's loaded at an offset and grows the stack down into this memory region
until the FDT's /memory could be decoded and a proper stack is set up.

A regression introduced earlier this year, caused the stack to grow down
from a non-16-byte address, which is fixed in [2].

[2]: https://lore.kernel.org/barebox/20241009060511.4121157-5-a.fatoum@xxxxxxxxxxxxxx/T/#ma381512862d22530382aff1662caadad2c8bc182

2) Using uncached memory for Virt I/O queues with KVM enabled is considerably
   slower. My guess is that these accesses keep getting trapped, but what I wonder
   about is the performance discrepancy between the big.LITTLE cores
   (measurement of barebox copying 1MiB using `time cp -v /dev/virtioblk0 /tmp`):

    KVM && !CACHED && 1x Cortex-A53:  0.137s
    KVM && !CACHED && 1x Cortex-A72: 54.030s
    KVM &&  CACHED && 1x Cortex-A53:  0.120s
    KVM &&  CACHED && 1x Cortex-A72:  0.035s

The A53s are CPUs 0-1 and the A72 are 2-5.

Any idea why accessing uncached memory from the big core is so much worse?

Thank you!
Ahmad

> 
> Thanks,
> 
> 	M.
> 

-- 
Pengutronix e.K.                           |                             |
Steuerwalder Str. 21                       | http://www.pengutronix.de/  |
31137 Hildesheim, Germany                  | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |