On Sat, May 15, 2021 at 5:36 PM John Paul Adrian Glaubitz <glaubitz@xxxxxxxxxxxxxxxxxxx> wrote: > On 5/14/21 2:22 PM, Arnd Bergmann wrote: > >> My Renesas SH4-Boards actually run an sh4a-Kernel, not an sh4-Kernel: > >> > >> root@tirpitz:~> uname -a > >> Linux tirpitz 5.11.0-rc4-00012-g10c03c5bf422 #161 PREEMPT Mon Jan 18 21:10:17 CET 2021 sh4a GNU/Linux > >> root@tirpitz:~> > >> > >> So, if this change reduces performance on sh4a, I would rather not merge it. > > > > It only makes a difference in very specific scenarios in which unaligned > > accesses are done in a fast path, e.g. when forwarding network packet > > at a high rate on a big-endian kernel (little-endian kernels wouldn't run into > > this on IP headers). If you have a use case for this machine on which the > > you can show a performance regression, I can add a patch on top to put > > the optimized sh4a get_unaligned_le32() back. Dropping this patch > > altogether would make the series much more complex because most of > > the associated code gets removed in the end. > > Hmm, okay. But why does code which sits below arch/sh have to be removed anyway? > > I don't fully understand why it poses any maintenance burden/ What I'm removing is the part that lets architectures override the generic version. > > As I mentioned, supporting "movua" in the compiler likely has a much > > larger impact on performance, as it would also help in user space, and > > it should improve the networking case on little-endian kernels by replacing > > the four separate byte loads/shift pairs with a movua plus a byteswap. > > The problem is that - at least in Debian - we use the sh4 baseline while the kernel > supports both sh4 and sh4a, so we can't use any of these instructions in userland at > the moment. I tried building an sh7785lcr_defconfig with and without the patch, and found that the only affected files are: - in-kernel nfs client - crc32c/sha1/sha256 hash functions - device probing for libata, scsi-core, scsi-disk, hid, r8168 (should not matter after boot) - msdos partition parsing Any nfs client performance difference is probably not even measurable even at gigabit ethernet speed. I see that the hash functions are notably different, but I don't know if the output from the new generic code is actually better or worse than the original. If you do think this is important, please try the version from https://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic.git unaligned-sh4a against the version without the last change in that series. If you can find a relevant test case that exercises it, you may want to add a custom implementation of the hash functions as well. Arnd