Re: [PATCH] Alpha: Emulate unaligned LDx_L/STx_C for data consistency

Richard Henderson <richard.henderson@xxxxxxxxxx> · Thu, 20 Feb 2025 09:54:38 -0800

On 2/19/25 04:46, Maciej W. Rozycki wrote:
Complementing compiler support for the `-msafe-bwa' and `-msafe-partial'
code generation options slated to land in GCC 15,

Pointer?  I can't find it on the gcc-patches list.

implement emulation
for unaligned LDx_L and STx_C operations for the unlikely case where an
alignment violation has resulted from improperly written code and caused
these operations to trap in atomic RMW memory access sequences made to
provide data consistency for non-BWX byte and word write operations, and
writes to unaligned data objects causing partial memory updates.

The principle of operation is as follows:

1. A trapping unaligned LDx_L operation results in the pair of adjacent
    aligned whole data quantities spanned being read and stored for the
    reference with a subsequent STx_C operation, along with the width of
    the data accessed and its virtual address, and the task referring or
    NULL if the kernel.  The valitidy marker is set.

2. Regular memory load operations are used to retrieve data because no
    atomicity is needed at this stage, and matching the width accessed,
    either LDQ_U or LDL even though the latter instruction requires extra
    operations, to avoid the case where an unaligned longword located
    entirely within an aligned quadword would complicate handling.

3. Data is masked, shifted and merged appropriately and returned in the
    intended register as the result of the trapping LDx_L instruction.

4. A trapping unaligned STx_C operation results in the valitidy marker
    being checked for being true, and the width of the data accessed
    along with the virtual address and the task referring or the kernel
    for a match.  The pair of whole data quantities previously read by
    LDx_L emulation is retrieved and the valitidy marker is cleared.

5. If the checks succeeded, then in an atomic loop the location of the
    first whole data quantity is reread, and data retrieved compared with
    the value previously obtained.  If there's no match, then the loop is
    aborted and 0 is returned in the intended register as the result of
    the trapping STx_C instruction and emulation completes.  Otherwise
    new data obtained from the source operand of STx_C is combined with
    the data retrieved, replacing by byte insertion the part intended,
    and an atomic write of this new data is attempted.  If it fails, the
    loop continues from the beginning.  Otherwise processing proceeds to
    the next step.

6. The same operations are performed on the second whole data quantity.

7. At this point both whole data quantities have been written, ensuring
    that no third-party intervening write has changed them at the point
    of the write from the values held at previous LDx_L.  Therefore 1 is
    returned in the intended register as the result of the trapping STx_C
    instruction.

I think general-purpose non-atomic emulation of STx_C is a really bad idea.

Without looking at your gcc patches, I can guess what you're after: you've generated a 
ll/sc sequence for (aligned) short, and want to emulate if it happens to be unaligned.

Crucially, when emulating non-aligned, you should not strive to make it atomic.  No other 
architecture promises atomic non-aligned stores, so why should you do that here?

I suggest some sort of magic code sequence,

	bic	addr_in, 6, addr_al
loop:
	ldq_l	t0, 0(addr_al)
	magic-nop done - loop
	inswl	data, addr_in, t1
	mskwl	t0, addr_in, t0
	bis	t0, t1, t0
	stq_c	t0, 0(addr_al)
	beq	t0, loop
done:

With the trap, match the magic-nop, pick out the input registers from the following inswl, 
perform the two (atomic!) byte stores to accomplish the emulation, adjust the pc forward 
to the done label.

Choose anything you like for the magic-nop.  The (done - loop) displacement is small, so 
any 8-bit immediate would suffice.  E.g. "eqv $31, disp, $31".  You might require the 
displacement to be constant and not actually extract "disp"; just match the entire 
uint32_t instruction pattern.

r~