On Tue, Apr 26, 2022 at 5:14 PM Borislav Petkov <bp@xxxxxxxxx> wrote: > > So when we enter the function, we shift %rcx to get the number of > qword-sized quantities to zero: > > SYM_FUNC_START(clear_user_original) > mov %rcx,%rax > shr $3,%rcx # qwords <--- Yes. But that's what we do for "rep stosq" too, for all the same reasons. > but when we encounter the fault here, we return *%rcx* - not %rcx << 3 > - latter being the *bytes* leftover which we *actually* need to return > when we encounter the #PF. Yes, but: > So, we need to shift back when we fault during the qword-sized zeroing, > i.e., full function below, see label 3 there. No. The problem is that you're using the wrong exception type. Thanks for posting the whole thing, because that makes it much more obvious. You have the exception table entries switched. You should have _ASM_EXTABLE_TYPE_REG(0b, 3b, EX_TYPE_UCOPY_LEN8, %rax) _ASM_EXTABLE_UA(2b, 3b) and not need that label '4' at all. Note how that "_ASM_EXTABLE_TYPE_REG" thing is literally designed to do %rcx = 8*%rcx+%rax in the exception handler. Of course, to get that regular _ASM_EXTABLE_UA(2b, 3b) to work, you need to have the final byte count in %rcx, not in %rax so that means that the "now do the rest of the bytes" case should have done something like movl %eax,%ecx 2: movb $0,(%rdi) inc %rdi decl %ecx jnz 2b instead. Yeah, yeah, you could also use that _ASM_EXTABLE_TYPE_REG thing for the second exception point, and keep %rcx as zero, and keep it in %eax, and depend on that whole "%rcx = 8*%rcx+%rax" fixing it up, and knowing that if an exception does *not* happen, %rcx will be zero from the word-size loop. But that really seems much too subtle for me - why not just keep things in %rcx, and try to make this look as much as possible like the "rep stosq + rep stosb" case? And finally: I still think that those fancy exception table things are *much* too fancy, and much too subtle, and much too complicated. So I'd actually prefer to get rid of them entirely, and make the code use the "no register changes" exception, and make the exception handler do a proper site-specific fixup. At that point, you can get rid of all the "mask bits early" logic, get rid of all the extraneous 'test' instructions, and make it all look something like below. NOTE! I've intentionally kept the %eax thing using 32-bit instructions - smaller encoding, and only the low three bits matter, so why move/mask full 64 bits? NOTE2! Entirely untested. But I tried to make the normal code do minimal work, and then fix things up in the exception case more. So it just keeps the original count in the 32 bits in %eax until it wants to test it, and then uses the 'andl' to both mask and test. And the exception case knows that, so it masks there too. I dunno. But I really think that whole new _ASM_EXTABLE_TYPE_REG and EX_TYPE_UCOPY_LEN8 was unnecessary. Linus SYM_FUNC_START(clear_user_original) movl %ecx,%eax shrq $3,%rcx # qwords jz .Lrest_bytes # do the qwords first .p2align 4 .Lqwords: movq $0,(%rdi) lea 8(%rdi),%rdi dec %rcx jnz .Lqwords .Lrest_bytes: andl $7,%eax # rest bytes jz .Lexit # now do the rest bytes .Lbytes: movb $0,(%rdi) inc %rdi decl %eax jnz .Lbytes .Lexit: xorl %eax,%eax RET .Lqwords_exception: # convert qwords back into bytes to return to caller shlq $3, %rcx andl $7, %eax addq %rax,%rcx jmp .Lexit .Lbytes_exception: movl %eax,%ecx jmp .Lexit _ASM_EXTABLE_UA(.Lqwords, .Lqwords_exception) _ASM_EXTABLE_UA(.Lbytes, .Lbytes_exception)