Re: [PATCH] Alpha: Emulate unaligned LDx_L/STx_C for data consistency

Matt Turner <mattst88@xxxxxxxxx> · Thu, 20 Feb 2025 11:46:36 -0500

On Wed, Feb 19, 2025 at 7:46 AM Maciej W. Rozycki <macro@xxxxxxxxxxx> wrote:
>
> Complementing compiler support for the `-msafe-bwa' and `-msafe-partial'
> code generation options slated to land in GCC 15, implement emulation
> for unaligned LDx_L and STx_C operations for the unlikely case where an
> alignment violation has resulted from improperly written code and caused
> these operations to trap in atomic RMW memory access sequences made to
> provide data consistency for non-BWX byte and word write operations, and
> writes to unaligned data objects causing partial memory updates.
>
> The principle of operation is as follows:
>
> 1. A trapping unaligned LDx_L operation results in the pair of adjacent
>    aligned whole data quantities spanned being read and stored for the
>    reference with a subsequent STx_C operation, along with the width of
>    the data accessed and its virtual address, and the task referring or
>    NULL if the kernel.  The valitidy marker is set.

Typo: validity (occurs twice more in the commit message below)

>
> 2. Regular memory load operations are used to retrieve data because no
>    atomicity is needed at this stage, and matching the width accessed,
>    either LDQ_U or LDL even though the latter instruction requires extra
>    operations, to avoid the case where an unaligned longword located
>    entirely within an aligned quadword would complicate handling.
>
> 3. Data is masked, shifted and merged appropriately and returned in the
>    intended register as the result of the trapping LDx_L instruction.
>
> 4. A trapping unaligned STx_C operation results in the valitidy marker
>    being checked for being true, and the width of the data accessed
>    along with the virtual address and the task referring or the kernel
>    for a match.  The pair of whole data quantities previously read by
>    LDx_L emulation is retrieved and the valitidy marker is cleared.
>
> 5. If the checks succeeded, then in an atomic loop the location of the
>    first whole data quantity is reread, and data retrieved compared with
>    the value previously obtained.  If there's no match, then the loop is
>    aborted and 0 is returned in the intended register as the result of
>    the trapping STx_C instruction and emulation completes.  Otherwise
>    new data obtained from the source operand of STx_C is combined with
>    the data retrieved, replacing by byte insertion the part intended,
>    and an atomic write of this new data is attempted.  If it fails, the
>    loop continues from the beginning.  Otherwise processing proceeds to
>    the next step.
>
> 6. The same operations are performed on the second whole data quantity.
>
> 7. At this point both whole data quantities have been written, ensuring
>    that no third-party intervening write has changed them at the point
>    of the write from the values held at previous LDx_L.  Therefore 1 is
>    returned in the intended register as the result of the trapping STx_C
>    instruction.
>
> 8. No user accesses are permitted in traps from the kernel mode as the
>    only LDx_L/STx_C accesses made to user memory locations by the kernel
>    are supposed to be those from handcrafted code, which has to written
>    such as not to trap.
>
> Since atomic loops are used for data updates the approach works equally
> well in both UP and SMP environments.  No data atomicity is guaranteed,
> but data consistency is, that is concurrent RMW accesses won't clobber
> each other, however if the same data is concurrently written as already
> there with a regular write between emulated LDx_L and STx_C, then STx_C
> will still succeed.  Likewise if data is modified, but then restored
> before STx_C has had a chance to run.
>
> This fulfils consistency requirements and guarantees that data outside
> the quantity written has not changed between emulated LDx_L and STx_C.
>
> Signed-off-by: Maciej W. Rozycki <macro@xxxxxxxxxxx>
> ---
> Hi,
>
>  This has cleared the pair of `-msafe-bwa -msafe-partial' regressions
> observed in GCC verification (the third one was a Modula 2 frontend bug,
> now fixed in the compiler).  I have verified individual misalignments with
> a small program by hand as well, for both the data retrieved by emulated
> LDx_L and the data stored by emulated STx_C.
>
>  The kernel itself built with `-mcpu=ev4 -msafe-bwa -msafe-partial' boots
> and has passed GCC verification, and triggered no extra unaligned traps.
>
>  Full verification was run with 6.3.0-rc5 and Ivan's stack alignment fixes
> applied just because I was confident already that version works correctly.
> Interestingly enough no kernel mode traps have triggered with a kernel
> built with GCC 12 (and with most user traps coming from GCC verification):
>
> kernel unaligned acc    : 0 (pc=0,va=0)
> user unaligned acc      : 1766720 (pc=20000053064,va=120020189)
>
> but with GCC 15 a small quantity happened (even before I ran GCC testing):
>
> kernel unaligned acc    : 78 (pc=fffffc0000ad5194,va=fffffc0002db5784)
> user unaligned acc      : 883452 (pc=20000053064,va=120020189)
>
> It seems a compiler regression worth checking -- the trap recorded was in
> `icmp6_dst_alloc' with a pair of quadword writes to `rt->rt6i_dst.addr',
> which however by its type (`struct in6_addr') is only longword-aligned and
> indeed starts at offset 148 from the outermost struct.  I have a sneaking
> suspicion one of my earlier GCC changes might be at fault.  At least I now
> have a test case to experiment with.
>
>  I've also built and booted 6.9.0-rc3 as at commit 82c525bfafb4 ("alpha:
> trim the unused stuff from asm-offsets.c"), the last one before support
> for my system was axed.  It has passed the verification with my small
> program (available by request; I'm not sure if it's worth turning into a
> kernel selftest).
>
>  NB I'm going to ignore the 72 errors checkpatch.pl issues for EXC usage.
> The coding style of the new additions is consistent with the rest of the
> file and any change to that would best be made separately (but I fail to
> see the point).
>
>  Questions, comments, concerns?  Otherwise please apply, and I'll proceed
> with the rest of the GCC effort, followed by cleaning handwritten assembly
> up that uses STQ_U in our port and in glibc.
>
>   Maciej
> ---
>  arch/alpha/kernel/traps.c |  409 ++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 400 insertions(+), 9 deletions(-)
>
> linux-alpha-llsc-unaligned.diff
> Index: linux-macro/arch/alpha/kernel/traps.c
> ===================================================================
> --- linux-macro.orig/arch/alpha/kernel/traps.c
> +++ linux-macro/arch/alpha/kernel/traps.c
> @@ -368,6 +368,13 @@ struct unaligned_stat {
>         unsigned long count, va, pc;
>  } unaligned[2];
>
> +/* Unaligned LDx_L/STx_C emulation state.  */
> +static DEFINE_RAW_SPINLOCK(ll_lock);
> +static struct task_struct *ll_task;
> +static unsigned long ll_data[2];
> +static unsigned long ll_va;
> +static bool ll_quad;
> +static bool ll_bit;
>
>  /* Macro for exception fixup code to access integer registers.  */
>  #define una_reg(r)  (_regs[(r) >= 16 && (r) <= 18 ? (r)+19 : (r)])
> @@ -381,6 +388,9 @@ do_entUna(void * va, unsigned long opcod
>         unsigned long pc = regs->pc - 4;
>         unsigned long *_regs = regs->regs;
>         const struct exception_table_entry *fixup;
> +       unsigned long flags;
> +       unsigned long la;
> +       bool ll_match;
>
>         unaligned[0].count++;
>         unaligned[0].va = (unsigned long) va;
> @@ -439,6 +449,65 @@ do_entUna(void * va, unsigned long opcod
>                 una_reg(reg) = tmp1|tmp2;
>                 return;
>
> +       case 0x2a: /* ldl_l */
> +               la = (unsigned long)va;
> +               if (la < TASK_SIZE)
> +                       break;
> +               __asm__ __volatile__(
> +               "1:     ldl %3,0(%5)\n"
> +               "2:     ldl %4,4(%5)\n"
> +               "       srl %3,%6,%1\n"
> +               "       sll %4,%7,%2\n"
> +               "       zapnot %1,15,%1\n"
> +               "       zapnot %2,15,%2\n"
> +               "3:\n"
> +               EXC(1b,3b,%1,%0)
> +               EXC(2b,3b,%2,%0)
> +                       : "=r"(error),
> +                         "=&r"(tmp1), "=r"(tmp2), "=&r"(tmp3), "=&r"(tmp4)
> +                       : "r"(la & ~3ul),
> +                         "r"((la & 3) * 8), "r"((4 - (la & 3)) * 8), "0"(0));
> +               if (error)
> +                       goto got_exception;
> +               raw_spin_lock_irqsave(&ll_lock, flags);
> +               ll_va = la;
> +               ll_task = NULL;
> +               ll_data[0] = tmp3;
> +               ll_data[1] = tmp4;
> +               ll_quad = false;
> +               ll_bit = true;
> +               raw_spin_unlock_irqrestore(&ll_lock, flags);
> +               una_reg(reg) = (int)(tmp1|tmp2);
> +               return;
> +
> +       case 0x2b: /* ldq_l */
> +               la = (unsigned long)va;
> +               if (la < TASK_SIZE)
> +                       break;
> +               __asm__ __volatile__(
> +               "1:     ldq_u %3,0(%5)\n"
> +               "2:     ldq_u %4,7(%5)\n"
> +               "       extql %3,%5,%1\n"
> +               "       extqh %4,%5,%2\n"
> +               "3:\n"
> +               EXC(1b,3b,%1,%0)
> +               EXC(2b,3b,%2,%0)
> +                       : "=r"(error),
> +                         "=&r"(tmp1), "=r"(tmp2), "=&r"(tmp3), "=&r"(tmp4)
> +                       : "r"(va), "0"(0));
> +               if (error)
> +                       goto got_exception;
> +               raw_spin_lock_irqsave(&ll_lock, flags);
> +               ll_va = la;
> +               ll_task = NULL;
> +               ll_data[0] = tmp3;
> +               ll_data[1] = tmp4;
> +               ll_quad = true;
> +               ll_bit = true;
> +               raw_spin_unlock_irqrestore(&ll_lock, flags);
> +               una_reg(reg) = tmp1|tmp2;
> +               return;
> +
>         /* Note that the store sequences do not indicate that they change
>            memory because it _should_ be affecting nothing in this context.
>            (Otherwise we have other, much larger, problems.)  */
> @@ -513,6 +582,134 @@ do_entUna(void * va, unsigned long opcod
>                 if (error)
>                         goto got_exception;
>                 return;
> +
> +       case 0x2e: /* stl_c */
> +               la = (unsigned long)va;
> +               if (la < TASK_SIZE)
> +                       break;
> +               raw_spin_lock_irqsave(&ll_lock, flags);
> +               ll_match = ll_bit;
> +               ll_match &= !ll_quad;
> +               ll_match &= ll_task == NULL;
> +               ll_match &= ll_va == la;
> +               tmp3 = ll_data[0];
> +               tmp4 = ll_data[1];
> +               ll_bit = false;
> +               raw_spin_unlock_irqrestore(&ll_lock, flags);
> +               if (ll_match) {
> +                       __asm__ __volatile__(
> +                       "       srl %6,%5,%3\n"
> +                       "       zapnot %3,%8,%3\n"
> +                       "1:     ldl_l %2,4(%4)\n"
> +                       "       cmpeq %7,%2,%1\n"
> +                       "       beq %1,4f\n"
> +                       "       zap %2,%8,%2\n"
> +                       "       or %2,%3,%1\n"
> +                       "2:     stl_c %1,4(%4)\n"
> +                       "       beq %1,3f\n"
> +                       "       .subsection 2\n"
> +                       "3:     br 1b\n"
> +                       "       .previous\n"
> +                       "4:\n"
> +                       EXC(1b,4b,%2,%0)
> +                       EXC(2b,4b,%1,%0)
> +                               : "=r"(error), "=&r"(ll_match),
> +                                 "=&r"(tmp1), "=&r"(tmp2)
> +                               : "r"(la & ~3ul), "r"((4 - (la & 3)) * 8),
> +                                 "r"(una_reg(reg)), "r"(tmp4),
> +                                 "r"((15 >> (4 - (la & 3))) & 0xf), "0"(0));
> +                       if (error)
> +                               goto got_exception;
> +               }
> +               if (ll_match) {
> +                       __asm__ __volatile__(
> +                       "       sll %6,%5,%3\n"
> +                       "       zapnot %3,%8,%3\n"
> +                       "1:     ldl_l %2,0(%4)\n"
> +                       "       cmpeq %7,%2,%1\n"
> +                       "       beq %1,4f\n"
> +                       "       zap %2,%8,%2\n"
> +                       "       or %2,%3,%1\n"
> +                       "2:     stl_c %1,0(%4)\n"
> +                       "       beq %1,3f\n"
> +                       "       .subsection 2\n"
> +                       "3:     br 1b\n"
> +                       "       .previous\n"
> +                       "4:\n"
> +                       EXC(1b,4b,%2,%0)
> +                       EXC(2b,4b,%1,%0)
> +                               : "=r"(error), "=&r"(ll_match),
> +                                 "=&r"(tmp1), "=&r"(tmp2)
> +                               : "r"(la & ~3ul), "r"((la & 3) * 8),
> +                                 "r"(una_reg(reg)), "r"(tmp3),
> +                                 "r"((15 << (la & 3)) & 0xf), "0"(0));
> +                       if (error)
> +                               goto got_exception;
> +               }
> +               una_reg(reg) = ll_match;
> +               return;
> +
> +       case 0x2f: /* stq_c */
> +               la = (unsigned long)va;
> +               if (la < TASK_SIZE)
> +                       break;
> +               raw_spin_lock_irqsave(&ll_lock, flags);
> +               ll_match = ll_bit;
> +               ll_match &= ll_quad;
> +               ll_match &= ll_task == NULL;
> +               ll_match &= ll_va == la;
> +               tmp3 = ll_data[0];
> +               tmp4 = ll_data[1];
> +               ll_bit = false;
> +               raw_spin_unlock_irqrestore(&ll_lock, flags);
> +               if (ll_match) {
> +                       __asm__ __volatile__(
> +                       "       insqh %6,%4,%3\n"
> +                       "1:     ldq_l %2,8(%5)\n"
> +                       "       cmpeq %7,%2,%1\n"
> +                       "       beq %1,4f\n"
> +                       "       mskqh %2,%4,%2\n"
> +                       "       or %2,%3,%1\n"
> +                       "2:     stq_c %1,8(%5)\n"
> +                       "       beq %1,3f\n"
> +                       "       .subsection 2\n"
> +                       "3:     br 1b\n"
> +                       "       .previous\n"
> +                       "4:\n"
> +                       EXC(1b,4b,%2,%0)
> +                       EXC(2b,4b,%1,%0)
> +                               : "=r"(error), "=&r"(ll_match),
> +                                 "=&r"(tmp1), "=&r"(tmp2)
> +                               : "r"(va), "r"(la & ~7ul),
> +                                 "r"(una_reg(reg)), "r"(tmp4), "0"(0));
> +                       if (error)
> +                               goto got_exception;
> +               }
> +               if (ll_match) {
> +                       __asm__ __volatile__(
> +                       "       insql %6,%4,%3\n"
> +                       "1:     ldq_l %2,0(%5)\n"
> +                       "       cmpeq %7,%2,%1\n"
> +                       "       beq %1,4f\n"
> +                       "       mskql %2,%4,%2\n"
> +                       "       or %2,%3,%1\n"
> +                       "2:     stq_c %1,0(%5)\n"
> +                       "       beq %1,3f\n"
> +                       "       .subsection 2\n"
> +                       "3:     br 1b\n"
> +                       "       .previous\n"
> +                       "4:\n"
> +                       EXC(1b,4b,%2,%0)
> +                       EXC(2b,4b,%1,%0)
> +                               : "=r"(error), "=&r"(ll_match),
> +                                 "=&r"(tmp1), "=&r"(tmp2)
> +                               : "r"(va), "r"(la & ~7ul),
> +                                 "r"(una_reg(reg)), "r"(tmp3), "0"(0));
> +                       if (error)
> +                               goto got_exception;
> +               }
> +               una_reg(reg) = ll_match;
> +               return;
>         }
>
>         printk("Bad unaligned kernel access at %016lx: %p %lx %lu\n",
> @@ -624,24 +821,33 @@ s_reg_to_mem (unsigned long s_reg)
>   * so finding the appropriate registers is a little more difficult
>   * than in the kernel case.
>   *
> - * Finally, we handle regular integer load/stores only.  In
> - * particular, load-linked/store-conditionally and floating point
> - * load/stores are not supported.  The former make no sense with
> - * unaligned faults (they are guaranteed to fail) and I don't think
> - * the latter will occur in any decent program.
> + * We have three classes of operations to handle:
>   *
> - * Sigh. We *do* have to handle some FP operations, because GCC will
> - * uses them as temporary storage for integer memory to memory copies.
> - * However, we need to deal with stt/ldt and sts/lds only.
> + * - We handle regular integer load/stores transparently to faulting
> + *   code, preserving the semantics of the triggering instruction.
> + *
> + * - We handle some FP operations as well, because GCC will use them as
> + *   temporary storage for integer memory to memory copies.  However,
> + *   we need to deal with stt/ldt and sts/lds only.
> + *
> + * - We handle load-locked/store-conditional operations by maintaining
> + *   data consistency only, within the two adjacent longwords or
> + *   quadwords partially spanned.  This is sufficient to guarantee an
> + *   unaligned RMW sequence using these operations won't clobber data
> + *   *outside* the location intended but does *not* guarantee atomicity
> + *   for the data quantity itself.
>   */
>
>  #define OP_INT_MASK    ( 1L << 0x28 | 1L << 0x2c   /* ldl stl */       \
> +                       | 1L << 0x2a | 1L << 0x2e   /* ldl_l stl_c */   \
>                         | 1L << 0x29 | 1L << 0x2d   /* ldq stq */       \
> +                       | 1L << 0x2b | 1L << 0x2f   /* ldq_l stq_c */   \
>                         | 1L << 0x0c | 1L << 0x0d   /* ldwu stw */      \
>                         | 1L << 0x0a | 1L << 0x0e ) /* ldbu stb */
>
>  #define OP_WRITE_MASK  ( 1L << 0x26 | 1L << 0x27   /* sts stt */       \
>                         | 1L << 0x2c | 1L << 0x2d   /* stl stq */       \
> +                       | 1L << 0x2e | 1L << 0x2d   /* stl_c stq_c */   \

stq_c should be 0x2f, not 0x2d. Looks like a copy-n-paste mistake.