Summary of discussions on BPF load-acquire instruction ordering

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello!

This is an attempted summary of the discussions on the memory-ordering
properties of the upcoming BPF load-acquire instruction.  Please reply
to the group calling out any errors, omissions, or other commentary.

TL;DR:  I am sticking with my position that the BPF load-acquire
instruction should have the weaker RCpc semantics (like ldapr, not ldar).
That said, it is entirely reasonable for ARM64 JITs to use the stronger
RCsc ldar instruction to implement the BPF load-acquire instruction.

Background and details:

o	The BPF load-acquire instruction might have RCsc ordering (like
	the ARM64 ldar instruction) or RCpc ordering (like the ARM64
	ldapr instruction).  One key difference between ldar and ldapr
	is that the ldar is ordered against prior stlr instruction,
	but ldapr is not.

	Note well that there is only the one ARM64 store-release
	instruction, stlr.  This instruction pairs equally well with
	ldar and ldapr.

o	The stronger semantics for the ldar instruction were added on
	the advice of Herb Sutter of Microsoft.  The weaker semantics
	for ldapr were added on the advice of other Microsoft employees
	who actuallly write performance-critical concurrent code,
	but for mid-range CPUs that do some reordering but not so much
	speculation.  (High-end CPUs that do serious reordering and also
	serious speculation don't usually care much about the difference
	in ordering semantics, outside of benchmarks specially crafted
	to demonstrate the difference.)

	(Perhaps of historical interest, this mirrors the advice for
	C's and C++'s non-SC atomics.  Herb passionately advocated
	for atomics to be only SC, but an even more passionate group
	elsewhere within Microsoft reached out privately to register
	support for non-SC atomics in the strongest terms possible.
	So C and C++ had non-SC atomics from the get-go.)

o	The compilers do not guarantee RCsc, only RCpc.  Attempts to
	provide stronger (and thus perhaps more expensive) RCsc
	semantics for the BPF load-acquire instruction can therefore
	be defeated by perfectly legal and reasonable compiler
	memory-reference-reordering optimizations.

o	The ARM64 ldar instruction was available first.  This means that
	any ARM64 JIT that emits ldapr for BPF load-acquire instructions
	must be prepared to emit ldar on older ARM64 hardware that does
	not support ldapr.

o	If BPF is to JIT efficiently to PowerPC, BPF's load-acquire
	instruction must be implementable as ld;lwsync.  This has similar
	RCpc memory-ordering semantics as ARM64's ldapr instruction.
	In contrast, an RCsc load-acquire instruction (like ARM64 ldar)
	would require sync;ld;lwsync.  The difference is that the sync
	instruction has global scope (its action covers the full system),
	while lwsync can be handled within the confines of the CPU's
	local store buffer.  The sync instruction is thus considerably
	more expensive than is the lwsync instruction.

o	The fact that the Linux kernel runs reliably on PowerPC when using
	"ld;lwsync" for smp_load_acquire() provides evidence that ARM64
	could safely use ldapr for smp_load_acquire() in common code.
	However:

	o	The fact that older hardware does not support ldapr
		and the fact that distros strongly prefer a single
		Linux-kernel image per architecture means that use
		of ldapr for smp_load_acquire() would likely require
		yet more boot-time binary rewriting, and might restrict
		migration of guest OSes from one ARM64 hardware system
		to another.

	o	There might well be uses of smp_load_acquire() in ARM64
		architecture-specific code that need to emit the stronger
		ldar instruction.

	o	There are way more ARM64 systems than PowerPC systems.
		It is entirely possible that PowerPC is just getting lucky
		with its use of "ld;lwsync".  However, this applies to BPF
		programs just as surely as it does to core Linux-kernel
		code.  Risk of failures due to use of the weaker RCpc
		instruction sequence is lower on PowerPC than on ARM64.

o	All this leads me to stick with my position called out in TL;DR
	above, namely that the BPF load-acquire instruction should have
	the weaker RCpc semantics (like ldapr, not ldar).  That said,
	it is entirely reasonable for ARM64 JITs to use the stronger RCsc
	ldar instruction to implement the BPF load-acquire instruction.

Thoughts?

						Thanx, Paul




[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux