Re: Page faults in tracepoint caused by aliased pointer

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Feb 12, 2024 at 5:16 PM Ignat Korchagin <ignat@xxxxxxxxxxxxxx> wrote:
>
> On Mon, Feb 12, 2024 at 10:55 PM Kumar Kartikeya Dwivedi
> <memxor@xxxxxxxxx> wrote:
> >
> > On Mon, 12 Feb 2024 at 23:14, Yan Zhai <yan@xxxxxxxxxxxxxx> wrote:
> > >
> > > Hello!
> > >
> > > We are getting page fault errors inside BPF tracepoint that accessed
> > > not-present pages. This caused kernel panic:
> > >
> > > [717542.963064][T897981] BUG: unable to handle page fault for address: ffffffffff600c7d
> > > [717542.975692][T897981] #PF: supervisor read access in kernel mode
> > > [717542.986496][T897981] #PF: error_code(0x0000) - not-present page
> > > [717542.997237][T897981] PGD 1965012067 P4D 1965012067 PUD 1965014067 PMD 1965016067 PTE 0
> > > [717543.009965][T897981] Oops: 0000 [#1] PREEMPT SMP NOPTI
> > > [717543.019835][T897981] CPU: 34 PID: 897981 Comm: warp-service Kdump: loaded Tainted: G           O       6.1.74-cloudflare-2024.1.14 #1
> > > [717543.041140][T897981] Hardware name: HYVE EDGE-METAL-GEN11/HS1811D_Lite, BIOS V0.11-sig 12/23/2022
> > > [717543.059260][T897981] RIP: 0010:bpf_prog_2eca29f2a4f78ed1_drop_monitor+0x326/0xada
> > > [717543.071449][T897981] Code: ff eb 07 48 8b bf f8 04 00 00 49 bb 80 0e 00 00 00 80 00 00 4c 39 df 72 0c 49 89 fb 49 81 c3 80 0e 00 00 73 05 45 31 ed eb 07 <4c> 8b af 80 0e 00 00 48 89 ee 48 83 c6 f0 48 bf 00 04 7a 0a 3d 9e
> > > [717543.104780][T897981] RSP: 0018:ffffaece810efab8 EFLAGS: 00010286
> > > [717543.115372][T897981] RAX: 0000000000000000 RBX: ffffcea96b4ae350 RCX: 0000000000000010
> > > [717543.127887][T897981] RDX: 0000000000000030 RSI: ffffffffac168443 RDI: ffffffffff5ffdfd
> > > [717543.140325][T897981] RBP: ffffaece810efb28 R08: ffff9e61e3b27c80 R09: 000000000000e000
> > > [717543.152712][T897981] R10: 0000000000000041 R11: ffffffffff600c7d R12: 00028c9a1e371991
> > > [717543.165011][T897981] R13: 0000000000000000 R14: ffff9e6339dce8c0 R15: ffff9e61e3b27c00
> > > [717543.177253][T897981] FS:  00007f769a1fd6c0(0000) GS:ffff9e6bdfa80000(0000) knlGS:0000000000000000
> > > [717543.194511][T897981] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [717543.205261][T897981] CR2: ffffffffff600c7d CR3: 0000003d21706005 CR4: 0000000000770ee0
> > > [717543.217411][T897981] PKRU: 55555554
> > > [717543.224999][T897981] Call Trace:
> > > [717543.232224][T897981]  <TASK>
> > > [717543.239016][T897981]  ? __die+0x20/0x70
> > > [717543.246661][T897981]  ? page_fault_oops+0x150/0x490
> > > [717543.255270][T897981]  ? __sk_dst_check+0x39/0xa0
> > > [717543.263548][T897981]  ? inet6_csk_route_socket+0x123/0x200
> > > [717543.272622][T897981]  ? exc_page_fault+0x67/0x140
> > > [717543.280831][T897981]  ? asm_exc_page_fault+0x22/0x30
> > > [717543.289230][T897981]  ? tcp_data_queue+0xc03/0xe20
> > > [717543.297374][T897981]  ? bpf_prog_2eca29f2a4f78ed1_drop_monitor+0x326/0xada
> > > [717543.307555][T897981]  ? bpf_prog_2eca29f2a4f78ed1_drop_monitor+0x281/0xada
> > > [717543.317638][T897981]  ? tcp_data_queue+0xc03/0xe20
> > > [717543.325540][T897981]  bpf_trace_run3+0x92/0xc0
> > > [717543.333026][T897981]  ? tcp_data_queue+0xc03/0xe20
> > > [717543.340823][T897981]  kfree_skb_reason+0x7b/0xd0
> > > [717543.348427][T897981]  tcp_data_queue+0xc03/0xe20
> > > [717543.355985][T897981]  tcp_rcv_established+0x218/0x740
> > > [717543.363944][T897981]  tcp_v4_do_rcv+0x157/0x290
> > > [717543.371315][T897981]  tcp_v4_rcv+0xddd/0xf00
> > > [717543.378330][T897981]  ? raw_local_deliver+0xc0/0x230
> > > [717543.385973][T897981]  ip_protocol_deliver_rcu+0x32/0x200
> > > [717543.393880][T897981]  ip_local_deliver_finish+0x73/0xa0
> > > [717543.401616][T897981]  __netif_receive_skb_one_core+0x8b/0xa0
> > > [717543.409751][T897981]  netif_receive_skb+0x38/0x160
> > > [717543.416920][T897981]  tun_get_user+0xbe6/0x1080 [tun]
> > > [717543.424292][T897981]  ? mlx5e_handle_rx_dim+0x6b/0x80 [mlx5_core]
> > > [717543.432754][T897981]  ? mlx5e_napi_poll+0x710/0x720 [mlx5_core]
> > > [717543.441007][T897981]  ? tun_chr_write_iter+0x69/0xb0 [tun]
> > > [717543.448753][T897981]  tun_chr_write_iter+0x69/0xb0 [tun]
> > > [717543.456312][T897981]  vfs_write+0x2a3/0x3b0
> > > [717543.462722][T897981]  ksys_write+0x5f/0xe0
> > > [717543.469018][T897981]  do_syscall_64+0x3b/0x90
> > > [717543.475522][T897981]  entry_SYSCALL_64_after_hwframe+0x4c/0xb6
> > > [717543.483443][T897981] RIP: 0033:0x7f76b3b3027f
> > > [717543.489848][T897981] Code: 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 39 d5 f8 ff 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 31 44 89 c7 48 89 44 24 08 e8 8c d5 f8 ff 48
> > > [717543.515551][T897981] RSP: 002b:00007f769a1f9870 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
> > > [717543.526219][T897981] RAX: ffffffffffffffda RBX: 0000000000000500 RCX: 00007f76b3b3027f
> > > [717543.536507][T897981] RDX: 0000000000000500 RSI: 00007f761a694a00 RDI: 00000000000015a4
> > > [717543.546815][T897981] RBP: 00007f75f53cf600 R08: 0000000000000000 R09: 00000000000272c8
> > > [717543.557136][T897981] R10: 00000000000075dc R11: 0000000000000293 R12: 00007f76b37b0198
> > > [717543.567447][T897981] R13: 0000000000000000 R14: 00007f76b37a4000 R15: 0000000000000004
> > > [717543.577777][T897981]  </TASK>
> > > [717543.583106][T897981] Modules linked in: mptcp_diag raw_diag unix_diag xt_LOG nf_log_syslog overlay nft_compat xt_hashlimit ip_set_hash_netport xt_length esp4 nf_conntrack_netlink nft_fwd_netdev nf_dup_netdev xfrm_interface xfrm6_tunnel nft_numgen nft_log nft_limit dummy xfrm_user xfrm_algo fou6 ip6_tunnel tunnel6 ipip mpls_gso mpls_iptunnel mpls_router sit tunnel4 fou nft_ct nf_tables cls_bpf ip_gre gre ip_tunnel geneve ip6_udp_tunnel udp_tunnel zstd zstd_compress zram zsmalloc sch_ingress tcp_diag veth tun udp_diag inet_diag dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio ip6t_REJECT nf_reject_ipv6 ip6table_filter ip6table_mangle ip6table_raw ip6table_security ip6table_nat ip6_tables ipt_REJECT nf_reject_ipv4 xt_tcpmss iptable_filter xt_TCPMSS xt_bpf xt_limit xt_multiport xt_NFLOG nfnetlink_log xt_connbytes xt_connlabel xt_statistic xt_mark xt_connmark xt_conntrack iptable_mangle xt_nat iptable_nat nf_nat xt_owner xt_set xt_comment xt_tcpudp xt_CT iptable_raw
> > > [717543.583186][T897981]  ip_set_hash_ip ip_set_hash_net ip_set nfnetlink tcp_bbr sch_fq nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 algif_skcipher af_alg raid0 md_mod essiv dm_crypt trusted asn1_encoder tee 8021q garp mrp stp llc nvme_fabrics ipmi_ssif amd64_edac kvm_amd kvm irqbypass crc32_pclmul crc32c_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 mlx5_core aesni_intel acpi_ipmi rapl ipmi_si mlxfw xhci_pci nvme tls ipmi_devintf tiny_power_button xhci_hcd nvme_core psample ccp i2c_piix4 ipmi_msghandler button fuse dm_mod dax efivarfs ip_tables x_tables bcmcrypt(O) crypto_simd cryptd [last unloaded: kheaders]
> > > [717543.774881][T897981] CR2: ffffffffff600c7d
> > >
> > > The panic happens as we inspect dropped out of order TCP packets in kfree_skb
> > > tracepoint with a tp_btf program, and try to read out the network namespace
> > > cookie via:
> > >
> > > skb->dev->nd_net.net->net_cookie
> > >
> > > Code generation looks fine on x86_64 with 4 layer pagetable, but the verifier
> > > placed boundary check is not sufficient to catch the issue: skb->dev is alised
> > > as skb->rbnode in the same union after packets entered TCP state machine, and
> > > the out of order queue is one of such rbnode users:
> > >
> > > ; uint64_t netns_cookie = skb->dev->nd_net.net->net_cookie;
> > >  2bd:   movabs $0x800000000010,%r11
> > >  2c7:   cmp    %r11,%r15
> > >  2ca:   jb     0x000002d8
> > >  2cc:   mov    %r15,%r11
> > >  2cf:   add    $0x10,%r11
> > >  2d6:   jae    0x000002dc
> > >  2d8:   xor    %edi,%edi
> > >  2da:   jmp    0x000002e0
> > >  2dc:   mov    0x10(%r15),%rdi
> > > ; uint64_t netns_cookie = skb->dev->nd_net.net->net_cookie;
> > >  2e0:   movabs $0x8000000004f8,%r11
> > >  2ea:   cmp    %r11,%rdi   <--- (1) rdi is a valid rbnode*, not net_device*
> > >  2ed:   jb     0x000002fb
> > >  2ef:   mov    %rdi,%r11
> > >  2f2:   add    $0x4f8,%r11
> > >  2f9:   jae    0x000002ff
> > >  2fb:   xor    %edi,%edi
> > >  2fd:   jmp    0x00000306
> > >  2ff:   mov    0x4f8(%rdi),%rdi
> > > ; uint64_t netns_cookie = skb->dev->nd_net.net->net_cookie;
> > >  306:   movabs $0x800000000e80,%r11
> > >  310:   cmp    %r11,%rdi  <--- (2) rdi is a wild ptr now
> > >  313:   jb     0x00000321
> > >  315:   mov    %rdi,%r11
> > >  318:   add    $0xe80,%r11
> > >  31f:   jae    0x00000326
> > >  321:   xor    %r13d,%r13d
> > >  324:   jmp    0x0000032d
> > >  326:   mov    0xe80(%rdi),%r13 <--- (3) fault
> > >  32d:   mov    %rbp,%rsi
> > >
> > > OOO happens a lot on our servers but this is the first time we noticed
> > > such panic since we had deployed the program for a while. For bpf list
> > > I think the question is mainly about what to do in this scenario:
> > > apparently it is a valid kernel pointer at step (1) above, but it's
> > > just not the type we assumed, which leads to a wild pointer at (2) and
> > > caused fault at (3). I am not aware of a way to determine such aliased
> > > pointer is good or not in general. Is it possible to PF safer in this
> > > case, like returning from PF handler to the end of tracing program?
> > >
> >
> > I think it is not supposed to panic, since exception handling for such
> > PROBE_MEM loads should handle such a case and mark the destination as
> > zero.
> > Something must be broken with that.

Just to clarify, by exception handling, do you mean there are some
special treatments in the page fault handler? Or do you mean page
fault should not happen in the first place because the BPF verifier
should catch it?

> >
> > Which kernel do you observe this problem with? And do you have a
> > reference version where you do not see it?
We saw this once today on the 6.1.74 kernel with some internally
vendor patches, but none of those ring a bell to this situation. Other
6.1.74 kernel does fine. To be clear, I did see netns cookies being
read as 0 when the skb->dev is NULL. But not sure if it also can deal
with a valid but aliased pointer this way.

> > Do you have a reduced reproducer for this that I could play with?
> > Just the part of the tp_btf program necessary to trigger this?
>
> We were able to reproduce this with a simple bpftrace (version 0.17.1):
>
> $ sudo bpftrace -kk -e 'BEGIN { print(*(uint8 *)0xffffffffff600c7d); exit(); }'
> WARNING: Addrspace is not set
> Attaching 1 probe...
> Killed
> [288931.216699][T109754] BUG: unable to handle page fault for address:
> ffffffffff600c7d
> [288931.217143][T109754] #PF: supervisor read access in kernel mode
> [288931.217143][T109754] #PF: error_code(0x0000) - not-present page
> [288931.217143][T109754] PGD fa5a1e067 P4D fa5a1e067 PUD fa5a20067 PMD
> fa5a22067 PTE 0
> [288931.217143][T109754] Oops: 0000 [#1] PREEMPT SMP NOPTI
> [288931.217143][T109754] CPU: 4 PID: 109754 Comm: bpftrace Not tainted
> 6.6.16+ #10
> [288931.217143][T109754] Hardware name: KubeVirt None/RHEL, BIOS
> edk2-20221207gitfff6d81270b5-9.el9 12/07/2022
> [288931.217143][T109754] RIP: 0010:copy_from_kernel_nofault+0x89/0xe0
> [288931.217143][T109754] Code: 48 83 c3 04 48 83 c5 04 49 83 ec 04 49
> 83 fc 01 76 13 66 8b 03 66 89 45 00 48 83 c3 02 48 83 c5 02 49 83 ec
> 02 4d 85 e4 74 05 <8a> 03 88 45 00 65 48 8b 04 25 80 11 03 00 83 a8 54
> 1b 00 00 01 31
> [288931.217143][T109754] RSP: 0018:ffff93d787777d38 EFLAGS: 00010202
> [288931.217143][T109754] RAX: ffff910d4830a0c0 RBX: ffffffffff600c7d
> RCX: 0000000000000010
> [288931.217143][T109754] RDX: 0000000000000030 RSI: 0000000000000001
> RDI: ffffffffff600c7d
> [288931.217143][T109754] RBP: ffff93d787777d87 R08: 0101010101010101
> R09: 00000000fffffdf4
> [288931.217143][T109754] R10: 0000000000000000 R11: 0000000000000000
> R12: 0000000000000001
> [288931.217143][T109754] R13: ffff93d7807eb000 R14: 000000000000000a
> R15: 0000000000000000
> [288931.217143][T109754] FS:  00007f2818fd79c0(0000)
> GS:ffff911c7f800000(0000) knlGS:0000000000000000
> [288931.217143][T109754] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [288931.217143][T109754] CR2: ffffffffff600c7d CR3: 00000001174f4000
> CR4: 0000000000350ee0
> [288931.217143][T109754] Call Trace:
> [288931.217143][T109754]  <TASK>
> [288931.217143][T109754]  ? __die+0x1f/0x70
> [288931.217143][T109754]  ? page_fault_oops+0x151/0x480
> [288931.217143][T109754]  ? srso_return_thunk+0x5/0x10
> [288931.217143][T109754]  ? alloc_empty_file+0x7a/0x120
> [288931.217143][T109754]  ? __d_instantiate+0x34/0xf0
> [288931.217143][T109754]  ? srso_return_thunk+0x5/0x10
> [288931.217143][T109754]  ? alloc_file+0x9b/0x170
> [288931.217143][T109754]  ? exc_page_fault+0x68/0x140
> [288931.217143][T109754]  ? asm_exc_page_fault+0x22/0x30
> [288931.217143][T109754]  ? copy_from_kernel_nofault+0x89/0xe0
> [288931.217143][T109754]  ? copy_from_kernel_nofault+0x1d/0xe0
> [288931.217143][T109754]  bpf_probe_read_compat+0x6a/0x90
> [288931.217143][T109754]  bpf_prog_27620a19791a7c9c_BEGIN+0x2e/0xe7
> [288931.217143][T109754]  ? srso_return_thunk+0x5/0x10
> [288931.217143][T109754]  __bpf_prog_test_run_raw_tp+0x2e/0x90
> [288931.217143][T109754]  bpf_prog_test_run_raw_tp+0xe6/0x1c0
> [288931.217143][T109754]  __sys_bpf+0x93a/0x26a0
> [288931.217143][T109754]  ? srso_return_thunk+0x5/0x10
> [288931.217143][T109754]  ? __check_object_size+0x16a/0x2c0
> [288931.217143][T109754]  ? srso_return_thunk+0x5/0x10
> [288931.217143][T109754]  __x64_sys_bpf+0x1a/0x30
> [288931.217143][T109754]  do_syscall_64+0x3a/0x90
> [288931.217143][T109754]  entry_SYSCALL_64_after_hwframe+0x56/0xc0
> [288931.217143][T109754] RIP: 0033:0x7f281b320719
> [288931.217143][T109754] Code: 08 89 e8 5b 5d c3 66 2e 0f 1f 84 00 00
> 00 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c
> 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b7 06 0d 00 f7
> d8 64 89 01 48
> [288931.217143][T109754] RSP: 002b:00007ffdab7f0118 EFLAGS: 00000246
> ORIG_RAX: 0000000000000141
> [288931.217143][T109754] RAX: ffffffffffffffda RBX: 00007ffdab7f01e8
> RCX: 00007f281b320719
> [288931.217143][T109754] RDX: 0000000000000050 RSI: 00007ffdab7f0120
> RDI: 000000000000000a
> [288931.217143][T109754] RBP: 00000000024e27a0 R08: 00007f28100cf880
> R09: 0000000000000016
> [288931.217143][T109754] R10: 0000000000000007 R11: 0000000000000246
> R12: 00000000ffffffff
> [288931.217143][T109754] R13: 00007ffdab7f10c8 R14: 000000000000000d
> R15: 00000000024e27a0
> [288931.217143][T109754]  </TASK>
> [288931.217143][T109754] Modules linked in: xt_bpf xt_conntrack
> nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack
> nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo xt_addrtype
> nft_compat nf_tables br_netfilter bridge stp llc overlay kvm_amd ccp
> kvm irqbypass crc32_pclmul sha512_ssse3 sha256_ssse3 sha1_ssse3
> aesni_intel crypto_simd cryptd virtio_balloon virtio_console
> tiny_power_button button fuse dm_mod dax configfs nfnetlink efivarfs
> ip_tables x_tables virtio_net net_failover virtio_blk virtio_scsi
> failover crc32c_intel i2c_i801 virtio_pci i2c_smbus
> virtio_pci_legacy_dev virtio_pci_modern_dev virtio virtio_ring
> [288931.217143][T109754] CR2: ffffffffff600c7d
> [288931.217143][T109754] ---[ end trace 0000000000000000 ]---
> [288931.509063][T109754] RIP: 0010:copy_from_kernel_nofault+0x89/0xe0
> [288931.509063][T109754] Code: 48 83 c3 04 48 83 c5 04 49 83 ec 04 49
> 83 fc 01 76 13 66 8b 03 66 89 45 00 48 83 c3 02 48 83 c5 02 49 83 ec
> 02 4d 85 e4 74 05 <8a> 03 88 45 00 65 48 8b 04 25 80 11 03 00 83 a8 54
> 1b 00 00 01 31
> [288931.509063][T109754] RSP: 0018:ffff93d787777d38 EFLAGS: 00010202
> [288931.509063][T109754] RAX: ffff910d4830a0c0 RBX: ffffffffff600c7d
> RCX: 0000000000000010
> [288931.509063][T109754] RDX: 0000000000000030 RSI: 0000000000000001
> RDI: ffffffffff600c7d
> [288931.509063][T109754] RBP: ffff93d787777d87 R08: 0101010101010101
> R09: 00000000fffffdf4
> [288931.509063][T109754] R10: 0000000000000000 R11: 0000000000000000
> R12: 0000000000000001
> [288931.509063][T109754] R13: ffff93d7807eb000 R14: 000000000000000a
> R15: 0000000000000000
> [288931.509063][T109754] FS:  00007f2818fd79c0(0000)
> GS:ffff911c7f800000(0000) knlGS:0000000000000000
> [288931.509063][T109754] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [288931.509063][T109754] CR2: ffffffffff600c7d CR3: 00000001174f4000
> CR4: 0000000000350ee0
> [288931.509063][T109754] note: bpftrace[109754] exited with irqs disabled
> [288932.319062][T109754] note: bpftrace[109754] exited with preempt_count 1
>
> And Jakub CCed here did it for 6.8.0-rc2+
>
> > There were some changes made to the JIT code around the bounds
> > checking to reduce the instruction count.
> > That was in 90156f4bfa21 ("bpf, x86: Improve PROBE_MEM runtime load check").
> > Especially when src_reg == dst_reg, the case which happens in the
> > splat at 0x2ff.
> > Nothing else comes immediately to mind in terms of changes that could
> > affect this exception handling stuff.

Thanks, let me check if this is included or helps.

> >
> > > thanks
> > > Yan
> > >
>
> Ignat





[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux