On Fri, Dec 30, 2022 at 4:10 PM Kirill A. Shutemov <kirill@xxxxxxxxxxxxx> wrote: > > I made it a per-cpu variable (outside struct tlb_state to be visible in > modules). __get/put_user_X() now have a single instruction to untag the > address and it is gated by X86_FEATURE_LAM. Yeah, that looks more reasonable to me. > BTW, am I blind or we have no infrastructure to hookup static branches > from assembly? I think you're right. > I would be a better fit than ALTERNATIVE here. It would allow to defer > overhead until the first user of the feature. Well, it would make the overhead worse once people actually start using it. So it's not obvious that a static branch is really the right thing to do. That said, while I think that UNTAG_ADDR is quite reasonable now, the more I look at getuser.S and putuser.S, the more I'm thinking that getting rid of the TASK_SIZE comparison entirely is the right thing to do on x86-64. It's really rather nasty, with not just that whole LA57 alternative, but it's doing a large 64-bit constant too. Now, on 32-bit, we do indeed have to compare against TASK_SIZE explicitly, but on 32-bit we could just use an immediate for the cmp instruction, so even there that whole "load constant" isn't really optimal. And on 64-bit, we really only need to check the high bit. In fact, we don't even want to *check* it, because then we need to do that disgusting array_index_mask_nospec thing to mask the bits for it, so it would be even better to use purely arithmetic with no conditionals anywhere. And that's exactly what we could do on x86-64: movq %rdx,%rax shrq $63,%rax orq %rax,%rdx would actually be noticeably better than what we do now for for TASK_SIZE checking _and_ for the array index masking (for putuser.S, we'd use %rbx instead of %rax in that sequence). The above three simple instructions would replace all of the games we now play with LOAD_TASK_SIZE_MINUS_N(0) cmp %_ASM_DX,%_ASM_AX jae bad_get_user sbb %_ASM_DX, %_ASM_DX /* array_index_mask_nospec() */ and %_ASM_DX, %_ASM_AX entirely. It would just turn all kernel addresses into all ones, which is then guaranteed to fault. So no need for any conditional that never triggers in real life anyway. On 32-bit, we'd still have to do that old sequence, but we'd replace the LOAD_TASK_SIZE_MINUS_N(0) cmp %_ASM_DX,%_ASM_AX with just the simpler cmp $TASK_SIZE_MAX-(n),%_ASM_AX since the only reason we do that immediate load is because there si no 64-bit immediate compare instruction. And once we don't test against TASK_SIZE, the need for UNTAG_ADDR just goes away, so now LAM is better too. In other words, we could actually improve on our current code _and_ simplify the LAM situation. Win-win. Anyway, I do not hate the version of the patch you posted, but I do think that the win-win of just making LAM not _have_ this issue in the first place might be the preferable one. The one thing that that "shift by 63 and bitwise or" trick does require is that the _ASM_EXTABLE_UA() thing for getuser/putuser would have to have an extra annotation to shut up the WARN_ONCE(trapnr == X86_TRAP_GP, "General protection fault in user access. Non-canonical address?"); in ex_handler_uaccess() for the GP trap that users can now cause by giving a non-canonical address with the high bit clear. So we'd probably just want a new EX_TYPE_* for these cases, but that still looks fairly straightforward. Hmm? Linus