Re: [kernel-hardening] Re: [PATCH v4.1 07/10] x86: narrow out of bounds syscalls to sys_read under speculation

Andy Lutomirski <luto@xxxxxxxxxx> · Sun, 21 Jan 2018 18:23:59 -0800

On Sun, Jan 21, 2018 at 6:04 PM, Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> On Sun, Jan 21, 2018 at 5:38 PM, Andy Lutomirski <luto@xxxxxxxxxx> wrote:
>>
>> 3. What's with sbb; and?  I can see two sane ways to do this.  One is
>> cmovaq [something safe], %rax,
>
> Heh. I think it's partly about being old-fashioned. sbb has always
> been around, and is the traditional trick for 0/-1.
>
> Also, my original suggested thing did the *access* too, and masked the
> result with the same mask.
>
> But I guess we could use cmov instead. It has very similar performance
> (ie it was relatively slow on P4, but so was sbb).
>
> However, I suspect it actually has a slightly higher register
> pressure, since you'd need to have that zero register (zero being the
> "safe" value), and the only good way to get a zero value is the xor
> thing, which affects flags and thus needs to be before the cmp.
>
> In contrast, the sbb trick has no early inputs needed.
>
> So on the whole, 'cmov' may be more natural on a conceptual level, but
> the sbb trick really is a very "traditional x86 thing" to do.

Fair enough.

That being said, what I *actually* want to do is to nuke this thing
entirely.  I just wrote a patch to turn off the SYSCALL64 fast path
entirely when retpolines are on.  Then this issue can be dealt with in
C.  I assume someone has a brilliant way to make gcc automatically do
something intelligent about guarded array access in C. </snicker>
Seriously, though, the retpolined fast path is actually slower than
the slow path on a "minimal" retpoline kernel (which is what I'm using
because Fedora hasn't pushed out a retpoline compiler yet), and I
doubt it's more than the tiniest speedup on a full retpoline kernel.

I've read a bunch of emails flying around saying that retpolines
aren't that bad.  In very informal benchmarking, a single mispredicted
ret (which is what a retpoline is) seems to take 20-30 cycles on
Skylake.  That's pretty far from "not bad".  Is IBRS somehow doing
something that adversely affects code that doesn't use indirect
branches?  Because I'm having a bit of a hard time imagining IBRS
hurting indirect branches worse than retpolines do.