On Wed, May 18, 2016 at 04:10:37PM +0100, David Howells wrote: > > Here's a set of patches to provide kernel atomics and bitops implemented > with ISO C++11 atomic intrinsics. The second part of the set makes the x86 > arch use the implementation. > > Note that the x86 patches are very rough. It would need to be made > compile-time conditional in some way and the old code can't really be > thrown away that easily - but it's a good way to make sure I'm not using > that code any more. Good to see this!!! [ . . . ] > There are some disadvantages also: > > (1) It's not available in gcc before gcc-4.7 and there will be some > seriously suboptimal code production before gcc-7.1. Probably need to keep the old stuff around for quite some time. But it would be good to test the C11 stuff out in any case. > (2) The compiler might misoptimise - for example, it might generate a > CMPXCHG loop rather than a BTR instruction to clear a bit. Which is one reason it would be very good to test it out. ;-) > (3) The C++11 memory model permits atomic instructions to be merged if > appropriate - for example, two adjacent __atomic_read() calls might > get merged if the return value of the first isn't examined. Making > the pointers volatile alleviates this. Note that gcc doesn't do this > yet. I could imagine a few situations where merging would be a good thing. But I can also imagine a great number where it would be very bad. > (4) The C++11 memory barriers are, in general, weaker than the kernel's > memory barriers are defined to be. Whether this actually matters is > arch dependent. Further, the C++11 memory barriers are > acquire/release, but some arches actually have load/store instead - > which may be a problem. C++11 does have memory_order_seq_cst barriers via atomic_thread_fence(), and on many architectures they are strong enough for the Linux kernel. However, the standard does not guarantee this. (But it is difficult on x86, ARM, and PowerPC to produce C++11 semantics without also producing Linux-kernel semantics.) C++11 has acquire/release barriers as well as acquire loads and release stores. Some of this will need case-by-case analysis. Here is a (dated) attempt: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0124r1.html > (5) On x86, the compiler doesn't tell you where the LOCK prefixes are so > they cannot be suppressed if only one CPU is online. > > > Things to be considered: > > (1) We could weaken the kernel memory model to for the benefit of arches > that have instructions that employ explicit acquire/release barriers - > but that may cause data races to occur based on assumptions we've > already made. Note, however, that powerpc already seems to have a > weaker memory model. PowerPC has relatively weak ordering for lock and unlock, and also for load-acquire and store-release. However, last I checked, it had full ordering for value-returning read-modify-write atomic operations. As you say, C11 could potentially produce weaker results. However, it should be possible to fix this where necessary. > (2) There are three sets of atomics (atomic_t, atomic64_t and > atomic_long_t). I can either put each in a separate file all nicely > laid out (see patch 3) or I can make a template header (see patch 4) > and use #defines with it to make all three atomics from one piece of > code. Which is best? The first is definitely easier to read, but the > second might be easier to maintain. The template-header approach is more compatible with the ftrace macros. ;-) > (3) I've added cmpxchg_return() and try_cmpxchg() to replace cmpxchg(). > I've also provided atomicX_t variants of these. These return the > boolean return value from the __atomic_compare_exchange_n() function > (which is carried in the Z flag on x86). Should this be rolled out > even without the ISO atomic implementation? > > (4) The current x86_64 bitops (set_bit() and co.) are technically broken. > The set_bit() function, for example, takes a 64-bit signed bit number > but implements this with BTSL, presumably because it's a shorter > instruction. > > The BTS instruction's bit number operand, however, is sized according > to the memory operand, so the upper 32 bits of the bit number are > discarded by BTSL. We should really be using BTSQ to implement this > correctly (and gcc does just that). In practice, though, it would > seem unlikely that this will ever be a problem as I think we're > unlikely to have a bitset with more than ~2 billion bits in it within > the kernel (it would be >256MiB in size). > > Patch 9 casts the pointers internally in the bitops functions to > persuade the kernel to use 32-bit bitops - but this only really > matters on x86. Should set_bit() and co. take int rather than long > bit number arguments to make the point? No real opinion on #3 and #4. > I've opened a number of gcc bugzillas to improve the code generated by the > atomics: > > (*) https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49244 > > __atomic_fetch_{and,or,xor}() don't generate locked BTR/BTS/BTC on x86 > and __atomic_load() doesn't generate TEST or BT. There is a patch for > this, but it leaves some cases not fully optimised. > > (*) https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70821 > > __atomic_fetch_{add,sub}() generates XADD rather than DECL when > subtracting 1 on x86. There is a patch for this. > > (*) https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70825 > > __atomic_compare_exchange_n() accesses the stack unnecessarily, > leading to extraneous stores being added when everything could be done > entirely within registers (on x86, powerpc64, aarch64). > > (*) https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70973 > > Can the __atomic*() ops on x86 generate a list of LOCK prefixes? > > (*) https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71153 > > aarch64 __atomic_fetch_and() generates a double inversion for the > LDSET instructions. There is a patch for this. > > (*) https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71162 > > powerpc64 should probably emit BNE- not BNE to retry the STDCX. > > > The patches can be found here also: > > http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=iso-atomic > > I have fixed up an x86_64 cross-compiler with the patches that I've been > given for the above and have booted the resulting kernel. Nice!!! Thanx, Paul > David > --- > David Howells (15): > cmpxchg_local() is not signed-value safe, so fix generic atomics > tty: ldsem_cmpxchg() should use cmpxchg() not atomic_long_cmpxchg() > Provide atomic_t functions implemented with ISO-C++11 atomics > Convert 32-bit ISO atomics into a template > Provide atomic64_t and atomic_long_t using ISO atomics > Provide 16-bit ISO atomics > Provide cmpxchg(), xchg(), xadd() and __add() based on ISO C++11 intrinsics > Provide an implementation of bitops using C++11 atomics > Make the ISO bitops use 32-bit values internally > x86: Use ISO atomics > x86: Use ISO bitops > x86: Use ISO xchg(), cmpxchg() and friends > x86: Improve spinlocks using ISO C++11 intrinsic atomics > x86: Make the mutex implementation use ISO atomic ops > x86: Fix misc cmpxchg() and atomic_cmpxchg() calls to use try/return variants > > > arch/x86/events/amd/core.c | 6 > arch/x86/events/amd/uncore.c | 4 > arch/x86/include/asm/atomic.h | 224 ----------- > arch/x86/include/asm/bitops.h | 143 ------- > arch/x86/include/asm/cmpxchg.h | 99 ----- > arch/x86/include/asm/cmpxchg_32.h | 3 > arch/x86/include/asm/cmpxchg_64.h | 6 > arch/x86/include/asm/mutex.h | 6 > arch/x86/include/asm/mutex_iso.h | 73 ++++ > arch/x86/include/asm/qspinlock.h | 2 > arch/x86/include/asm/spinlock.h | 18 + > arch/x86/kernel/acpi/boot.c | 12 - > arch/x86/kernel/apic/apic.c | 3 > arch/x86/kernel/cpu/mcheck/mce.c | 7 > arch/x86/kernel/kvm.c | 5 > arch/x86/kernel/smp.c | 2 > arch/x86/kvm/mmu.c | 2 > arch/x86/kvm/paging_tmpl.h | 11 - > arch/x86/kvm/vmx.c | 21 + > arch/x86/kvm/x86.c | 19 - > arch/x86/mm/pat.c | 2 > arch/x86/xen/p2m.c | 3 > arch/x86/xen/spinlock.c | 6 > drivers/tty/tty_ldsem.c | 2 > include/asm-generic/atomic.h | 17 + > include/asm-generic/iso-atomic-long.h | 32 ++ > include/asm-generic/iso-atomic-template.h | 572 +++++++++++++++++++++++++++++ > include/asm-generic/iso-atomic.h | 28 + > include/asm-generic/iso-atomic16.h | 27 + > include/asm-generic/iso-atomic64.h | 28 + > include/asm-generic/iso-bitops.h | 188 ++++++++++ > include/asm-generic/iso-cmpxchg.h | 180 +++++++++ > include/linux/atomic.h | 26 + > 33 files changed, 1225 insertions(+), 552 deletions(-) > create mode 100644 arch/x86/include/asm/mutex_iso.h > create mode 100644 include/asm-generic/iso-atomic-long.h > create mode 100644 include/asm-generic/iso-atomic-template.h > create mode 100644 include/asm-generic/iso-atomic.h > create mode 100644 include/asm-generic/iso-atomic16.h > create mode 100644 include/asm-generic/iso-atomic64.h > create mode 100644 include/asm-generic/iso-bitops.h > create mode 100644 include/asm-generic/iso-cmpxchg.h > -- To unsubscribe from this list: send the line "unsubscribe linux-arch" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html