Re: Page faults in tracepoint caused by aliased pointer

Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx> · Mon, 12 Feb 2024 23:55:17 +0100

On Mon, 12 Feb 2024 at 23:14, Yan Zhai <yan@xxxxxxxxxxxxxx> wrote:
>
> Hello!
>
> We are getting page fault errors inside BPF tracepoint that accessed
> not-present pages. This caused kernel panic:
>
> [717542.963064][T897981] BUG: unable to handle page fault for address: ffffffffff600c7d
> [717542.975692][T897981] #PF: supervisor read access in kernel mode
> [717542.986496][T897981] #PF: error_code(0x0000) - not-present page
> [717542.997237][T897981] PGD 1965012067 P4D 1965012067 PUD 1965014067 PMD 1965016067 PTE 0
> [717543.009965][T897981] Oops: 0000 [#1] PREEMPT SMP NOPTI
> [717543.019835][T897981] CPU: 34 PID: 897981 Comm: warp-service Kdump: loaded Tainted: G           O       6.1.74-cloudflare-2024.1.14 #1
> [717543.041140][T897981] Hardware name: HYVE EDGE-METAL-GEN11/HS1811D_Lite, BIOS V0.11-sig 12/23/2022
> [717543.059260][T897981] RIP: 0010:bpf_prog_2eca29f2a4f78ed1_drop_monitor+0x326/0xada
> [717543.071449][T897981] Code: ff eb 07 48 8b bf f8 04 00 00 49 bb 80 0e 00 00 00 80 00 00 4c 39 df 72 0c 49 89 fb 49 81 c3 80 0e 00 00 73 05 45 31 ed eb 07 <4c> 8b af 80 0e 00 00 48 89 ee 48 83 c6 f0 48 bf 00 04 7a 0a 3d 9e
> [717543.104780][T897981] RSP: 0018:ffffaece810efab8 EFLAGS: 00010286
> [717543.115372][T897981] RAX: 0000000000000000 RBX: ffffcea96b4ae350 RCX: 0000000000000010
> [717543.127887][T897981] RDX: 0000000000000030 RSI: ffffffffac168443 RDI: ffffffffff5ffdfd
> [717543.140325][T897981] RBP: ffffaece810efb28 R08: ffff9e61e3b27c80 R09: 000000000000e000
> [717543.152712][T897981] R10: 0000000000000041 R11: ffffffffff600c7d R12: 00028c9a1e371991
> [717543.165011][T897981] R13: 0000000000000000 R14: ffff9e6339dce8c0 R15: ffff9e61e3b27c00
> [717543.177253][T897981] FS:  00007f769a1fd6c0(0000) GS:ffff9e6bdfa80000(0000) knlGS:0000000000000000
> [717543.194511][T897981] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [717543.205261][T897981] CR2: ffffffffff600c7d CR3: 0000003d21706005 CR4: 0000000000770ee0
> [717543.217411][T897981] PKRU: 55555554
> [717543.224999][T897981] Call Trace:
> [717543.232224][T897981]  <TASK>
> [717543.239016][T897981]  ? __die+0x20/0x70
> [717543.246661][T897981]  ? page_fault_oops+0x150/0x490
> [717543.255270][T897981]  ? __sk_dst_check+0x39/0xa0
> [717543.263548][T897981]  ? inet6_csk_route_socket+0x123/0x200
> [717543.272622][T897981]  ? exc_page_fault+0x67/0x140
> [717543.280831][T897981]  ? asm_exc_page_fault+0x22/0x30
> [717543.289230][T897981]  ? tcp_data_queue+0xc03/0xe20
> [717543.297374][T897981]  ? bpf_prog_2eca29f2a4f78ed1_drop_monitor+0x326/0xada
> [717543.307555][T897981]  ? bpf_prog_2eca29f2a4f78ed1_drop_monitor+0x281/0xada
> [717543.317638][T897981]  ? tcp_data_queue+0xc03/0xe20
> [717543.325540][T897981]  bpf_trace_run3+0x92/0xc0
> [717543.333026][T897981]  ? tcp_data_queue+0xc03/0xe20
> [717543.340823][T897981]  kfree_skb_reason+0x7b/0xd0
> [717543.348427][T897981]  tcp_data_queue+0xc03/0xe20
> [717543.355985][T897981]  tcp_rcv_established+0x218/0x740
> [717543.363944][T897981]  tcp_v4_do_rcv+0x157/0x290
> [717543.371315][T897981]  tcp_v4_rcv+0xddd/0xf00
> [717543.378330][T897981]  ? raw_local_deliver+0xc0/0x230
> [717543.385973][T897981]  ip_protocol_deliver_rcu+0x32/0x200
> [717543.393880][T897981]  ip_local_deliver_finish+0x73/0xa0
> [717543.401616][T897981]  __netif_receive_skb_one_core+0x8b/0xa0
> [717543.409751][T897981]  netif_receive_skb+0x38/0x160
> [717543.416920][T897981]  tun_get_user+0xbe6/0x1080 [tun]
> [717543.424292][T897981]  ? mlx5e_handle_rx_dim+0x6b/0x80 [mlx5_core]
> [717543.432754][T897981]  ? mlx5e_napi_poll+0x710/0x720 [mlx5_core]
> [717543.441007][T897981]  ? tun_chr_write_iter+0x69/0xb0 [tun]
> [717543.448753][T897981]  tun_chr_write_iter+0x69/0xb0 [tun]
> [717543.456312][T897981]  vfs_write+0x2a3/0x3b0
> [717543.462722][T897981]  ksys_write+0x5f/0xe0
> [717543.469018][T897981]  do_syscall_64+0x3b/0x90
> [717543.475522][T897981]  entry_SYSCALL_64_after_hwframe+0x4c/0xb6
> [717543.483443][T897981] RIP: 0033:0x7f76b3b3027f
> [717543.489848][T897981] Code: 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 39 d5 f8 ff 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 31 44 89 c7 48 89 44 24 08 e8 8c d5 f8 ff 48
> [717543.515551][T897981] RSP: 002b:00007f769a1f9870 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
> [717543.526219][T897981] RAX: ffffffffffffffda RBX: 0000000000000500 RCX: 00007f76b3b3027f
> [717543.536507][T897981] RDX: 0000000000000500 RSI: 00007f761a694a00 RDI: 00000000000015a4
> [717543.546815][T897981] RBP: 00007f75f53cf600 R08: 0000000000000000 R09: 00000000000272c8
> [717543.557136][T897981] R10: 00000000000075dc R11: 0000000000000293 R12: 00007f76b37b0198
> [717543.567447][T897981] R13: 0000000000000000 R14: 00007f76b37a4000 R15: 0000000000000004
> [717543.577777][T897981]  </TASK>
> [717543.583106][T897981] Modules linked in: mptcp_diag raw_diag unix_diag xt_LOG nf_log_syslog overlay nft_compat xt_hashlimit ip_set_hash_netport xt_length esp4 nf_conntrack_netlink nft_fwd_netdev nf_dup_netdev xfrm_interface xfrm6_tunnel nft_numgen nft_log nft_limit dummy xfrm_user xfrm_algo fou6 ip6_tunnel tunnel6 ipip mpls_gso mpls_iptunnel mpls_router sit tunnel4 fou nft_ct nf_tables cls_bpf ip_gre gre ip_tunnel geneve ip6_udp_tunnel udp_tunnel zstd zstd_compress zram zsmalloc sch_ingress tcp_diag veth tun udp_diag inet_diag dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio ip6t_REJECT nf_reject_ipv6 ip6table_filter ip6table_mangle ip6table_raw ip6table_security ip6table_nat ip6_tables ipt_REJECT nf_reject_ipv4 xt_tcpmss iptable_filter xt_TCPMSS xt_bpf xt_limit xt_multiport xt_NFLOG nfnetlink_log xt_connbytes xt_connlabel xt_statistic xt_mark xt_connmark xt_conntrack iptable_mangle xt_nat iptable_nat nf_nat xt_owner xt_set xt_comment xt_tcpudp xt_CT iptable_raw
> [717543.583186][T897981]  ip_set_hash_ip ip_set_hash_net ip_set nfnetlink tcp_bbr sch_fq nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 algif_skcipher af_alg raid0 md_mod essiv dm_crypt trusted asn1_encoder tee 8021q garp mrp stp llc nvme_fabrics ipmi_ssif amd64_edac kvm_amd kvm irqbypass crc32_pclmul crc32c_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 mlx5_core aesni_intel acpi_ipmi rapl ipmi_si mlxfw xhci_pci nvme tls ipmi_devintf tiny_power_button xhci_hcd nvme_core psample ccp i2c_piix4 ipmi_msghandler button fuse dm_mod dax efivarfs ip_tables x_tables bcmcrypt(O) crypto_simd cryptd [last unloaded: kheaders]
> [717543.774881][T897981] CR2: ffffffffff600c7d
>
> The panic happens as we inspect dropped out of order TCP packets in kfree_skb
> tracepoint with a tp_btf program, and try to read out the network namespace
> cookie via:
>
> skb->dev->nd_net.net->net_cookie
>
> Code generation looks fine on x86_64 with 4 layer pagetable, but the verifier
> placed boundary check is not sufficient to catch the issue: skb->dev is alised
> as skb->rbnode in the same union after packets entered TCP state machine, and
> the out of order queue is one of such rbnode users:
>
> ; uint64_t netns_cookie = skb->dev->nd_net.net->net_cookie;
>  2bd:   movabs $0x800000000010,%r11
>  2c7:   cmp    %r11,%r15
>  2ca:   jb     0x000002d8
>  2cc:   mov    %r15,%r11
>  2cf:   add    $0x10,%r11
>  2d6:   jae    0x000002dc
>  2d8:   xor    %edi,%edi
>  2da:   jmp    0x000002e0
>  2dc:   mov    0x10(%r15),%rdi
> ; uint64_t netns_cookie = skb->dev->nd_net.net->net_cookie;
>  2e0:   movabs $0x8000000004f8,%r11
>  2ea:   cmp    %r11,%rdi   <--- (1) rdi is a valid rbnode*, not net_device*
>  2ed:   jb     0x000002fb
>  2ef:   mov    %rdi,%r11
>  2f2:   add    $0x4f8,%r11
>  2f9:   jae    0x000002ff
>  2fb:   xor    %edi,%edi
>  2fd:   jmp    0x00000306
>  2ff:   mov    0x4f8(%rdi),%rdi
> ; uint64_t netns_cookie = skb->dev->nd_net.net->net_cookie;
>  306:   movabs $0x800000000e80,%r11
>  310:   cmp    %r11,%rdi  <--- (2) rdi is a wild ptr now
>  313:   jb     0x00000321
>  315:   mov    %rdi,%r11
>  318:   add    $0xe80,%r11
>  31f:   jae    0x00000326
>  321:   xor    %r13d,%r13d
>  324:   jmp    0x0000032d
>  326:   mov    0xe80(%rdi),%r13 <--- (3) fault
>  32d:   mov    %rbp,%rsi
>
> OOO happens a lot on our servers but this is the first time we noticed
> such panic since we had deployed the program for a while. For bpf list
> I think the question is mainly about what to do in this scenario:
> apparently it is a valid kernel pointer at step (1) above, but it's
> just not the type we assumed, which leads to a wild pointer at (2) and
> caused fault at (3). I am not aware of a way to determine such aliased
> pointer is good or not in general. Is it possible to PF safer in this
> case, like returning from PF handler to the end of tracing program?
>

I think it is not supposed to panic, since exception handling for such
PROBE_MEM loads should handle such a case and mark the destination as
zero.
Something must be broken with that.

Which kernel do you observe this problem with? And do you have a
reference version where you do not see it?
Do you have a reduced reproducer for this that I could play with?
Just the part of the tp_btf program necessary to trigger this?

There were some changes made to the JIT code around the bounds
checking to reduce the instruction count.
That was in 90156f4bfa21 ("bpf, x86: Improve PROBE_MEM runtime load check").
Especially when src_reg == dst_reg, the case which happens in the
splat at 0x2ff.
Nothing else comes immediately to mind in terms of changes that could
affect this exception handling stuff.

> thanks
> Yan
>