Hi Alexei, On Wed, Feb 21, 2018 at 06:20:37PM -0800, Alexei Starovoitov wrote: > On Wed, Feb 21, 2018 at 01:13:03PM +0100, Florian Westphal wrote: > > > > Obvious candidates are: meta, numgen, limit, objref, quota, reject. > > > > We should probably also consider removing > > CONFIG_NFT_SET_RBTREE and CONFIG_NFT_SET_HASH and just always > > build both too (at least rbtree since that offers interval). > > > > For the indirect call issue we can use direct calls from eval loop for > > some of the more frequently used ones, similar to what we do already > > for nft_cmp_fast_expr. > > nft_cmp_fast_expr and other expressions mentioned above made me thinking... > > do we have the same issue with nft interpreter as we had with bpf one? > bpf interpreter was used as part of spectre2 attack to leak > information via cache side channel and let VM read hypervisor memory. > Due to that issue we removed bpf interpreter from the kernel code. > That's what CONFIG_BPF_JIT_ALWAYS_ON for... > but we still have nft interpreter in the kernel that can also > execute arbitrary nft expressions. > > Jann's exploit used the following bpf instructions: > struct bpf_insn evil_bytecode_instrs[] = { > // rax = target_byte_addr > { .code = BPF_LD | BPF_IMM | BPF_DW, .dst_reg = 0, .imm = target_byte_addr }, { .imm = target_byte_addr>>32 }, We don't place pointers in the nft VM registers, it's basically illegal to do so, otherwise we would need more sophisticated verifier. I'm telling this because we don't have a way to point to any arbitrary address as in 'target_byte_addr' above. > // rdi = timing_leak_array > { .code = BPF_LD | BPF_IMM | BPF_DW, .dst_reg = 1, .imm = host_timing_leak_addr }, { .imm = host_timing_leak_addr>>32 }, > // rax = *(u8*)rax > { .code = BPF_LDX | BPF_MEM | BPF_B, .dst_reg = 0, .src_reg = 0, .off = 0 }, > // rax = rax << ... > { .code = BPF_ALU64 | BPF_LSH | BPF_K, .dst_reg = 0, .imm = 10 - bit_idx }, > // rax = rax & 0x400 > { .code = BPF_ALU64 | BPF_AND | BPF_K, .dst_reg = 0, .imm = 0x400 }, > // rax = rdi + rax > { .code = BPF_ALU64 | BPF_ADD | BPF_X, .dst_reg = 0, .src_reg = 1 }, > // *(u8*) (rax + 0x800) > { .code = BPF_LDX | BPF_MEM | BPF_B, .dst_reg = 0, .src_reg = 0, .off = 0x800 }, > > and a gadget to jump into __bpf_prog_run with insn pointing > to memory controlled by the guest while accessible > (at different virt address) by the hypervisor. > > It seems possible to construct similar sequence of instructions > out of nft expressions and use gadget that jumps into nft_do_chain(). > The attacker would need to discover more kernel addresses: > nft_do_chain, nft_cmp_fast_ops, nft_payload_fast_ops, nft_bitwise_eval, > nft_lookup_eval, and nft_bitmap_lookup > to populate nft chains, rules and expressions in guest memory > comparing to bpf interpreter attack. > > Then in nft_do_chain(struct nft_pktinfo *pkt, void *priv) > pkt needs to point to fake struct sk_buff in guest memory with > skb->head == target_byte_addr We don't have a way to make this point to fake struct sk_buff. > The first nft expression can be nft_payload_fast_eval(). > If it's properly constructed with > (nft_payload->based == NFT_PAYLOAD_NETWORK_HEADER, offset == 0, len == 0, dreg == 1) We can reject len == 0. To be honest, this is not done right now, but we can place a patch to validate this. Given this is a specialized networking virtual machine, it retain semantics, so fetching zero length data from a skbuff makes no sense, hence, we can return EINVAL via netlink when adding a rule that tries to do this. > it will do arbitrary load of > *(u8 *)dest = *(u8 *)ptr; > from target_byte_addr into register 1 of nft state machine > (dest is u32 array of registers in the stack of nft_do_chain) > Second nft expression can be nft_bitwise_eval() to mask particular > bit in register 1. > Then nft_cmp_eval() to check whether bit is one or zero and > conditional NFT_BREAK out of first nft expression into second nft rule. > The last conditional nft_immediate_eval() in the first rule will set > register 1 to 0x400 * 8 while the first nft_bitwise_eval() in > the second rule with do r1 &= 0x400 * 8. > So at this point r1 will have either 0x400 * 8 or 0 depending > on value of speculatively loaded bit. > The last expression can be nft_lookup_eval() with > nft_lookup->set->ops->lookup == nft_bitmap_lookup > which will do nft_bitmap->bitmap[idx] where idx = r1 / 8 > The memory used for this last nft_lookup/bitmap expression is > both an instruction and timing_leak_array itself. > If I'm not mistaken, this sequence of nft expression will > speculatively execute very similar logic as in evil_bytecode_instrs[] My impression is that several assumptions above are not correct. > The amount of actual speculative native cpu load/stores/branches is > probably more than executed by bpf interpreter for these evil bytecodes, > but likely well within cpu speculation window of 100+ insns. > > Obviously such exploit is harder to do than bpf based one. > Do we need to do anything about it ? > May be it's easier to find gadgets in .text of vmlinux > instead of messing with interpreters? > > Jann, > can you comment on removing interpreters in general? > Do we need to worry about having bpf and/or nft interpreter > in the kernel? Jann, any review on that front is very much welcome, so please let me know if you find any real problem. Thanks for reviewing! -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html