Hello Marc, On 06.10.24 12:28, Marc Zyngier wrote: > On Sun, 06 Oct 2024 08:59:56 +0100, > Ahmad Fatoum <a.fatoum@xxxxxxxxxxxxxx> wrote: >> On 05.10.24 23:35, Marc Zyngier wrote: >>> On Sat, 05 Oct 2024 19:38:23 +0100, >>> Ahmad Fatoum <a.fatoum@xxxxxxxxxxxxxx> wrote: >> One more question: This upgrading of DC IVAC to DC CIVAC is because >> the code is run under virtualization, right? > > Not necessarily. Virtualisation mandates the upgrade, but CIVAC is > also a perfectly valid implementation of both IVAC and CVAC. And it > isn't uncommon that CPUs implement everything the same way. Makes sense. After all, software should expect cache lines to be evicted at any time due to capacity misses anyway. >> I think following fix on the barebox side may work: >> >> - Walk all pages about to be remapped >> - Execute the AT instruction on the page's base address > > Why do you need AT if you are walking the PTs? If you walk, you > already have access to the memory attributes. In general, AT can be > slower than an actual walk. > > Or did you actually mean iterating over the VA range? Even in that > case, AT can be a bad idea, as you are likely to iterate in page-size > increments even if you have a block mapping. Walking the PTs tells you > immediately how much a leaf is mapping (assuming you don't have any > other tracking). There's no other tracking and I hoped that using AT (which is already being used for the mmuinfo shell command) would be easier. I see now that it would be too suboptimal to do it this way and have implemented a revised arch_remap_range[1] for barebox, which I just Cc'd you on. [1]: https://lore.kernel.org/barebox/20241009060511.4121157-5-a.fatoum@xxxxxxxxxxxxxx/T/#u >> - Only if the page was previously mapped cacheable, clean + invalidate >> the cache >> - Remove the current cache invalidation after remap >> >> Does that sound sensible? > > This looks reasonable (apart from the AT thingy). I have two (hopefully the last!) questions about remaining differing behavior with KVM and without: 1) Unaligned stack accesses crash in KVM: start: /* This will be mapped at 0x40080000 */ ldr x0, =0x4007fff0 mov sp, x0 stp x0, x1, [sp] // This is ok ldr x0, =0x4007fff8 mov sp, x0 stp x0, x1, [sp] // This crashes I know that the stack should be 16 byte aligned, but why does it crash only under KVM? Context: The barebox Image used for Qemu has a Linux ARM64 "Image" header, so it's loaded at an offset and grows the stack down into this memory region until the FDT's /memory could be decoded and a proper stack is set up. A regression introduced earlier this year, caused the stack to grow down from a non-16-byte address, which is fixed in [2]. [2]: https://lore.kernel.org/barebox/20241009060511.4121157-5-a.fatoum@xxxxxxxxxxxxxx/T/#ma381512862d22530382aff1662caadad2c8bc182 2) Using uncached memory for Virt I/O queues with KVM enabled is considerably slower. My guess is that these accesses keep getting trapped, but what I wonder about is the performance discrepancy between the big.LITTLE cores (measurement of barebox copying 1MiB using `time cp -v /dev/virtioblk0 /tmp`): KVM && !CACHED && 1x Cortex-A53: 0.137s KVM && !CACHED && 1x Cortex-A72: 54.030s KVM && CACHED && 1x Cortex-A53: 0.120s KVM && CACHED && 1x Cortex-A72: 0.035s The A53s are CPUs 0-1 and the A72 are 2-5. Any idea why accessing uncached memory from the big core is so much worse? Thank you! Ahmad > > Thanks, > > M. > -- Pengutronix e.K. | | Steuerwalder Str. 21 | http://www.pengutronix.de/ | 31137 Hildesheim, Germany | Phone: +49-5121-206917-0 | Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 |