Re: [PATCH] Alpha: Emulate unaligned LDx_L/STx_C for data consistency

"Maciej W. Rozycki" <macro@xxxxxxxxxxx> · Thu, 20 Feb 2025 21:05:54 +0000 (GMT)

On Thu, 20 Feb 2025, Richard Henderson wrote:

> > Complementing compiler support for the `-msafe-bwa' and `-msafe-partial'
> > code generation options slated to land in GCC 15,
> 
> Pointer?  I can't find it on the gcc-patches list.

 Here: 
<https://inbox.sourceware.org/gcc-patches/alpine.DEB.2.21.2501050246590.49841@xxxxxxxxxxxxxxxxx/>
and hopefully in your inbox/archive somewhere as well.

> > 7. At this point both whole data quantities have been written, ensuring
> >     that no third-party intervening write has changed them at the point
> >     of the write from the values held at previous LDx_L.  Therefore 1 is
> >     returned in the intended register as the result of the trapping STx_C
> >     instruction.
> 
> I think general-purpose non-atomic emulation of STx_C is a really bad idea.
> 
> Without looking at your gcc patches, I can guess what you're after: you've
> generated a ll/sc sequence for (aligned) short, and want to emulate if it
> happens to be unaligned.

 It's a corner case, yes, when the compiler was told the access would be 
aligned, but it turns out not.  It's where you cast a (char *) pointer to 
(short *) that wasn't suitably aligned for such a cast and dereference it 
(and the quadword case is similarly for the ends of misaligned inline 
`memcpy'/`memset').

 Only two cases (plus a bug in GM2 frontend) hitting this throughout the 
GCC testsuite show the rarity of this case.

> Crucially, when emulating non-aligned, you should not strive to make it
> atomic.  No other architecture promises atomic non-aligned stores, so why
> should you do that here?

 This code doesn't strive to be atomic, but to preserve data *outside* the 
quantity accessed from being clobbered, and for this purpose an atomic 
sequence is both inevitable and sufficient, for both partial quantities 
around the unaligned quantity written.  The trapping code does not expect 
atomicity for the unaligned quantity itself -- it is handled in pieces 
just as say with MIPS SWL/SWR masked store instruction pairs -- and this 
code, effectively an Alpha/Linux psABI extension, does not guarantee it 
either.

> I suggest some sort of magic code sequence,
> 
> 	bic	addr_in, 6, addr_al
> loop:
> 	ldq_l	t0, 0(addr_al)
> 	magic-nop done - loop
> 	inswl	data, addr_in, t1
> 	mskwl	t0, addr_in, t0
> 	bis	t0, t1, t0
> 	stq_c	t0, 0(addr_al)
> 	beq	t0, loop
> done:
> 
> With the trap, match the magic-nop, pick out the input registers from the
> following inswl, perform the two (atomic!) byte stores to accomplish the
> emulation, adjust the pc forward to the done label.

 It seems to make no sense to me to penalise all user code for the corner 
case mentioned above while still having the emulation in the kernel, given 
that 99.999...% of accesses will have been correctly aligned by GCC.  And 
it gets even more complex when you have an awkward number of bytes to 
mask, such as 3, 5, 6, 7, which will happen for example if inline `memcpy' 
is expanded by GCC for a quadword-aligned block of 31-bytes, in which case 
other instructions will be used for masking/insertion for the trailing 7 
bytes, and the block turns out misaligned at run time.

 I'm inconvinced, it seems a lot of hassle for little gain to me.

  Maciej