D Scott Phillips <scott@xxxxxxxxxxxxxxxxxxxxxx> writes: > Alex Bennée <alex.bennee@xxxxxxxxxx> writes: > >> From: D Scott Phillips <scott@xxxxxxxxxxxxxxxxxxxxxx> >> >> A later patch will hand out Device memory in some cases to code >> which expects a Normal memory type, as an errata workaround. >> Unaligned accesses to Device memory will fault though, so here we >> add a fixup handler to emulate faulting accesses, at a performance >> penalty. >> >> Many of the instructions in the Loads and Stores group are supported, >> but these groups are not handled here: >> >> * Advanced SIMD load/store multiple structures >> * Advanced SIMD load/store multiple structures (post-indexed) >> * Advanced SIMD load/store single structure >> * Advanced SIMD load/store single structure (post-indexed) > > Hi Alex, I'm keeping my version of these patches here: > > https://github.com/AmpereComputing/linux-ampere-altra-erratum-pcie-65 > > It looks like the difference to the version you've harvested is that > I've since added handling for these load/store types: > > Advanced SIMD load/store multiple structure > Advanced SIMD load/store multiple structure (post-indexed) > Advanced SIMD load/store single structure > Advanced SIMD load/store single structure (post-indexed) Are you going to roll in the fixes I added or should I re-spin with your additional handling? > I've never sent these patches because in my opinion there's too much > complexity to maintain upstream for this workaround, though now they're > here so we can have that conversation. It's not totally out of the scope of the kernel to do instruction decoding to workaround things that can't be executed directly. There is already a bunch of instruction decode logic to handle stepping over instruction probes. The 32 bit ARM code even has a complete user-space alignment fixup handler driver by procfs. It might make sense to share some of the logic although of course the probe handler and the misaligned handler are targeting different sets of instructions. The core kernel code also has a bunch of unaligned load/store helper functions that could probably be re-used as well to further reduce the code delta. > Finally, I think a better approach overall would have been to have > device memory mapping in the stage 2 page table, then booting with pkvm > would have this workaround for both the host and guests. I don't think > that approach changes the fact that there's too much complexity here. That would be a cleaner solution for pKVM although we would like to see it ported to Xen as well. There is a tension there between having a generic fixup library and something tightly integrated into a given kernel/hypervisor. I don't think instruction decoding is fundamentally too complicated for a kernel - although I may be biased as a QEMU developer ;-). However if it is to be taken forward I think it should probably come with an exhaustive test case to exercise the decoder and fixup handler. The fixes so far were found by repeatedly iterating on vkmark and seeing were things failed and fixing when they came up. I will leave it to the kernel maintainers to decide if this is an acceptable workaround or not. I do have two remaining questions: - Why do AMD GPUs trigger this and not nVidia? Do they just have their own fixup code hidden in the binary blob? Would Nouveau suffer similar problems? - Will Arm SoC PCI implementations continue to see these edge cases that don't affect the x86 world? This is not the first Arm machine with PCI to see issues. In fact of the 3 machines I have (SynQuacer, MachiatoBin and AVA) all have some measure of PCI "fun" to deal with. Thanks, -- Alex Bennée Virtualisation Tech Lead @ Linaro