On Wed, 09 Oct 2024 07:11:52 +0100, Ahmad Fatoum <a.fatoum@xxxxxxxxxxxxxx> wrote: > > Hello Marc, > > On 06.10.24 12:28, Marc Zyngier wrote: > > On Sun, 06 Oct 2024 08:59:56 +0100, > > Ahmad Fatoum <a.fatoum@xxxxxxxxxxxxxx> wrote: > >> On 05.10.24 23:35, Marc Zyngier wrote: > >>> On Sat, 05 Oct 2024 19:38:23 +0100, > >>> Ahmad Fatoum <a.fatoum@xxxxxxxxxxxxxx> wrote: > >> One more question: This upgrading of DC IVAC to DC CIVAC is because > >> the code is run under virtualization, right? > > > > Not necessarily. Virtualisation mandates the upgrade, but CIVAC is > > also a perfectly valid implementation of both IVAC and CVAC. And it > > isn't uncommon that CPUs implement everything the same way. > > Makes sense. After all, software should expect cache lines to > be evicted at any time due to capacity misses anyway. > > >> I think following fix on the barebox side may work: > >> > >> - Walk all pages about to be remapped > >> - Execute the AT instruction on the page's base address > > > > Why do you need AT if you are walking the PTs? If you walk, you > > already have access to the memory attributes. In general, AT can be > > slower than an actual walk. > > > > Or did you actually mean iterating over the VA range? Even in that > > case, AT can be a bad idea, as you are likely to iterate in page-size > > increments even if you have a block mapping. Walking the PTs tells you > > immediately how much a leaf is mapping (assuming you don't have any > > other tracking). > > There's no other tracking and I hoped that using AT (which is already > being used for the mmuinfo shell command) would be easier. Heavy use of AT is debatable. It requires heavy synchronisation (the ISB between AT and the PAR_EL1 readout), traps in some circumstances, and is not guaranteed to report the exact content of the page tables when it comes to memory attributes (it can instead report what the implementation actually does). In turn, it *may* hit in the TLBs. Or thrash them. But the real use of AT is when you do not have a virtual mapping for your page tables, making them difficult to walk in SW. For example, KVM uses AT to walk the guest's stage-1 PTs on stage-2 fault, because we don't have the guest's memory mapped at EL2 in general. Maybe this doesn't matter in your case, as this doesn't happen often enough to justify a dedicated SW walker. > > I see now that it would be too suboptimal to do it this way and have > implemented a revised arch_remap_range[1] for barebox, which I just > Cc'd you on. I'll try to have a look. > > [1]: https://lore.kernel.org/barebox/20241009060511.4121157-5-a.fatoum@xxxxxxxxxxxxxx/T/#u > > >> - Only if the page was previously mapped cacheable, clean + invalidate > >> the cache > >> - Remove the current cache invalidation after remap > >> > >> Does that sound sensible? > > > > This looks reasonable (apart from the AT thingy). > > I have two (hopefully the last!) questions about remaining differing > behavior with KVM and without: > > 1) Unaligned stack accesses crash in KVM: > > start: /* This will be mapped at 0x40080000 */ > ldr x0, =0x4007fff0 > mov sp, x0 > stp x0, x1, [sp] // This is ok > > ldr x0, =0x4007fff8 > mov sp, x0 > stp x0, x1, [sp] // This crashes > > I know that the stack should be 16 byte aligned, but why does it crash > only under KVM? KVM shouldn't be involved in any of this. What is SCTLR_EL1.SA set to? KVM resets it to 1, which is legal ("On a Warm reset, this field resets to an architecturally UNKNOWN value."). You shouldn't rely on it being 0 if you are doing unaligned accesses to the stack pointer. > > Context: The barebox Image used for Qemu has a Linux ARM64 "Image" header, > so it's loaded at an offset and grows the stack down into this memory region > until the FDT's /memory could be decoded and a proper stack is set up. > > A regression introduced earlier this year, caused the stack to grow down > from a non-16-byte address, which is fixed in [2]. > > [2]: https://lore.kernel.org/barebox/20241009060511.4121157-5-a.fatoum@xxxxxxxxxxxxxx/T/#ma381512862d22530382aff1662caadad2c8bc182 > > 2) Using uncached memory for Virt I/O queues with KVM enabled is considerably > slower. My guess is that these accesses keep getting trapped, but what I wonder > about is the performance discrepancy between the big.LITTLE cores > (measurement of barebox copying 1MiB using `time cp -v /dev/virtioblk0 /tmp`): > > KVM && !CACHED && 1x Cortex-A53: 0.137s > KVM && !CACHED && 1x Cortex-A72: 54.030s > KVM && CACHED && 1x Cortex-A53: 0.120s > KVM && CACHED && 1x Cortex-A72: 0.035s > > The A53s are CPUs 0-1 and the A72 are 2-5. > > Any idea why accessing uncached memory from the big core is so much worse? KVM shouldn't trap these accesses at all, except for the initial faulting-in of the memory (once per page). But to see such a difference, the problem is likely somewhere else. As you seem to be using QEMU as the VMM, you are likely to fall into the "Mismatched attributes" sink-hole (see B2.16 "Mismatched memory attributes" in the ARM ARM). The gist of the problem is that QEMU is using a cacheable mapping for all of the guest memory, while you are using a NC mapping. Are these two guaranteed to be coherent? Not at all. The abysmal level of performance is likely to be caused by the rate of miss in the cache on the QEMU side. It simply doesn't observe the guest's writes until it misses (for fun, giggles and self-promotion, see [1]). As an experiment, you could run something that actively puts a lot of cache pressure in parallel to your guest. You should see the performance increase significantly. Now, this is very much a case of "don't do that". Virtio is, by definition, a cache coherent device. So either you use it with attributes that maintain coherency (and everything works), or you add cache maintenance to your virtio implementation (good luck getting the ordering right). HTH, M. [1] http://events17.linuxfoundation.org/sites/events/files/slides/slides_10.pdf -- Without deviation from the norm, progress is not possible.