On 2/19/25 04:46, Maciej W. Rozycki wrote:
Complementing compiler support for the `-msafe-bwa' and `-msafe-partial'
code generation options slated to land in GCC 15,
Pointer? I can't find it on the gcc-patches list.
implement emulation
for unaligned LDx_L and STx_C operations for the unlikely case where an
alignment violation has resulted from improperly written code and caused
these operations to trap in atomic RMW memory access sequences made to
provide data consistency for non-BWX byte and word write operations, and
writes to unaligned data objects causing partial memory updates.
The principle of operation is as follows:
1. A trapping unaligned LDx_L operation results in the pair of adjacent
aligned whole data quantities spanned being read and stored for the
reference with a subsequent STx_C operation, along with the width of
the data accessed and its virtual address, and the task referring or
NULL if the kernel. The valitidy marker is set.
2. Regular memory load operations are used to retrieve data because no
atomicity is needed at this stage, and matching the width accessed,
either LDQ_U or LDL even though the latter instruction requires extra
operations, to avoid the case where an unaligned longword located
entirely within an aligned quadword would complicate handling.
3. Data is masked, shifted and merged appropriately and returned in the
intended register as the result of the trapping LDx_L instruction.
4. A trapping unaligned STx_C operation results in the valitidy marker
being checked for being true, and the width of the data accessed
along with the virtual address and the task referring or the kernel
for a match. The pair of whole data quantities previously read by
LDx_L emulation is retrieved and the valitidy marker is cleared.
5. If the checks succeeded, then in an atomic loop the location of the
first whole data quantity is reread, and data retrieved compared with
the value previously obtained. If there's no match, then the loop is
aborted and 0 is returned in the intended register as the result of
the trapping STx_C instruction and emulation completes. Otherwise
new data obtained from the source operand of STx_C is combined with
the data retrieved, replacing by byte insertion the part intended,
and an atomic write of this new data is attempted. If it fails, the
loop continues from the beginning. Otherwise processing proceeds to
the next step.
6. The same operations are performed on the second whole data quantity.
7. At this point both whole data quantities have been written, ensuring
that no third-party intervening write has changed them at the point
of the write from the values held at previous LDx_L. Therefore 1 is
returned in the intended register as the result of the trapping STx_C
instruction.
I think general-purpose non-atomic emulation of STx_C is a really bad idea.
Without looking at your gcc patches, I can guess what you're after: you've generated a
ll/sc sequence for (aligned) short, and want to emulate if it happens to be unaligned.
Crucially, when emulating non-aligned, you should not strive to make it atomic. No other
architecture promises atomic non-aligned stores, so why should you do that here?
I suggest some sort of magic code sequence,
bic addr_in, 6, addr_al
loop:
ldq_l t0, 0(addr_al)
magic-nop done - loop
inswl data, addr_in, t1
mskwl t0, addr_in, t0
bis t0, t1, t0
stq_c t0, 0(addr_al)
beq t0, loop
done:
With the trap, match the magic-nop, pick out the input registers from the following inswl,
perform the two (atomic!) byte stores to accomplish the emulation, adjust the pc forward
to the done label.
Choose anything you like for the magic-nop. The (done - loop) displacement is small, so
any 8-bit immediate would suffice. E.g. "eqv $31, disp, $31". You might require the
displacement to be constant and not actually extract "disp"; just match the entire
uint32_t instruction pattern.
r~