From: Vladislav K. Valtchev > Sent: 04 May 2021 19:08 > > Hi everyone, > > I noticed that in the Linux kernel we have a define called > CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS that's used in a fair amount of places > with the following purpose: when it's set, unaligned access is supported by the > CPU so we can do it directly, otherwise fall-back to some logic where a byte at > the time is read/written. For example, check the implementation of > do_strncpy_from_user(): > https://elixir.bootlin.com/linux/latest/source/lib/strncpy_from_user.c#L15 > > > That's nice and seems to work today as expected, but there's one problem: > unaligned access is UB according to the ISO C standard, ... The C standard isn't relevant. ... > > Mitigations > ------------ > In my understanding, the simplest way to force GCC to emit a single MOV > instruction with unaligned access, without risking any kind of UB, is to use > __builtin_memcpy(). It works great, but it requires fixing the code in many > places. Also, maybe using get_unaligned()/put_unaligned() is the right thing to > do? The problem is that the put_unaligned_* inline functions don't use > __builtin_memcpy() and are defined like: > > static __always_inline void put_unaligned_le32(u32 val, void *p) > { > *((__le32 *)p) = cpu_to_le32(val); > } > > So, still UB. To make the compiler happy, maybe we should make them use > __builtin_memcpy()? That doesn't help you at all. If the compiler is given any hint that the address is aligned it will use potentially misaligned memory accesses. So if the above was: *(int __attribute__((packed)) *)p = val; and the caller had: int *p = maybe_unaligned(); put_unaligned_le32(0, p); Then gcc will generate a 32bit write even on sparc. __builtin_memcpy() will expand to exactly the same 32bit write. This is nothing new - at least 20 years. Basically, the C standard only allows you to cast a pointer to 'void *' and then back to its original type. GCC makes use of this for various optimisations and will track the alignment through (void *) casts. I believe that, even for the simd instructions, gcc will pick the 'misalign supporting' version if the data is marked with the relevant alignment. Oh - and the loop vectorisation code is a pile of crap. You almost never want it - look at the code it generates for a loop you go round a few times. It you are doing 10000 iteractions you'll need to unroll it yourself to get decent performance. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)