Re: [PATCHv13 05/16] x86/uaccess: Provide untagged_addr() and remove tags before address check

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Fri, 30 Dec 2022 16:42:05 -0800

On Fri, Dec 30, 2022 at 4:10 PM Kirill A. Shutemov <kirill@xxxxxxxxxxxxx> wrote:
>
> I made it a per-cpu variable (outside struct tlb_state to be visible in
> modules). __get/put_user_X() now have a single instruction to untag the
> address and it is gated by X86_FEATURE_LAM.

Yeah, that looks more reasonable to me.

> BTW, am I blind or we have no infrastructure to hookup static branches
> from assembly?

I think you're right.

> I would be a better fit than ALTERNATIVE here. It would allow to defer
> overhead until the first user of the feature.

Well, it would make the overhead worse once people actually start
using it. So it's not obvious that a static branch is really the right
thing to do.

That said, while I think that UNTAG_ADDR is quite reasonable now, the
more I look at getuser.S and putuser.S, the more I'm thinking that
getting rid of the TASK_SIZE comparison entirely is the right thing to
do on x86-64.

It's really rather nasty, with not just that whole LA57 alternative,
but it's doing a large 64-bit constant too.

Now, on 32-bit, we do indeed have to compare against TASK_SIZE
explicitly, but on 32-bit we could just use an immediate for the cmp
instruction, so even there that whole "load constant" isn't really
optimal.

And on 64-bit, we really only need to check the high bit.

In fact, we don't even want to *check* it, because then we need to do
that disgusting array_index_mask_nospec thing to mask the bits for it,
so it would be even better to use purely arithmetic with no
conditionals anywhere.

And that's exactly what we could do on x86-64:

        movq %rdx,%rax
        shrq $63,%rax
        orq %rax,%rdx

would actually be noticeably better than what we do now for for
TASK_SIZE checking _and_ for the array index masking (for putuser.S,
we'd use %rbx instead of %rax in that sequence).

The above three simple instructions would replace all of the games we
now play with

        LOAD_TASK_SIZE_MINUS_N(0)
        cmp %_ASM_DX,%_ASM_AX
        jae bad_get_user
        sbb %_ASM_DX, %_ASM_DX          /* array_index_mask_nospec() */
        and %_ASM_DX, %_ASM_AX

entirely.

It would just turn all kernel addresses into all ones, which is then
guaranteed to fault. So no need for any conditional that never
triggers in real life anyway.

On 32-bit, we'd still have to do that old sequence, but we'd replace the

        LOAD_TASK_SIZE_MINUS_N(0)
        cmp %_ASM_DX,%_ASM_AX

with just the simpler

        cmp $TASK_SIZE_MAX-(n),%_ASM_AX

since the only reason we do that immediate load is because there si no
64-bit immediate compare instruction.

And once we don't test against TASK_SIZE, the need for UNTAG_ADDR just
goes away, so now LAM is better too.

In other words, we could actually improve on our current code _and_
simplify the LAM situation. Win-win.

Anyway, I do not hate the version of the patch you posted, but I do
think that the win-win of just making LAM not _have_ this issue in the
first place might be the preferable one.

The one thing that that "shift by 63 and bitwise or" trick does
require is that the _ASM_EXTABLE_UA() thing for getuser/putuser would
have to have an extra annotation to shut up the

        WARN_ONCE(trapnr == X86_TRAP_GP, "General protection fault in
user access. Non-canonical address?");

in ex_handler_uaccess() for the GP trap that users can now cause by
giving a non-canonical address with the high bit clear. So we'd
probably just want a new EX_TYPE_* for these cases, but that still
looks fairly straightforward.

Hmm?

              Linus