Re: [PATCH] net: tls: fix tls with sk_redirect using a BPF verdict.

"Ziyang Xuan (William)" <william.xuanziyang@xxxxxxxxxx> · Tue, 5 Jul 2022 11:26:53 +0800

> On Tue, 28 Jun 2022 17:25:05 +0200 Julien Salleyron wrote:
>> This patch allows to use KTLS on a socket where we apply sk_redirect using a BPF
>> verdict program.
>>
>> Without this patch, we see that the data received after the redirection are
>> decrypted but with an incorrect offset and length. It seems to us that the
>> offset and length are correct in the stream-parser data, but finally not applied
>> in the skb. We have simply applied those values to the skb.
>>
>> In the case of regular sockets, we saw a big performance improvement from
>> applying redirect. This is not the case now with KTLS, may be related to the
>> following point.
> 
> It's because kTLS does a very expensive reallocation and copy for the
> non-zerocopy case (which currently means all of TLS 1.3). I have
> code almost ready to fix that (just needs to be reshuffled into
> upstreamable patches). Brings us up from 5.9 Gbps to 8.4 Gbps per CPU
> on my test box with 16k records. Probably much more than that with
> smaller records.
> 
>> It is still necessary to perform a read operation (never triggered) from user
>> space despite the redirection. It makes no sense, since this read operation is
>> not necessary on regular sockets without KTLS.
>>
>> We do not see how to fix this problem without a change of architecture, for
>> example by performing TLS decrypt directly inside the BPF verdict program.
>>
>> An example program can be found at
>> https://github.com/juliens/ktls-bpf_redirect-example/
>>
>> Co-authored-by: Marc Vertes <mvertes@xxxxxxx>
>> ---
>>  net/tls/tls_sw.c                           | 6 ++++++
>>  tools/testing/selftests/bpf/test_sockmap.c | 8 +++-----
>>  2 files changed, 9 insertions(+), 5 deletions(-)
>>
>> diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
>> index 0513f82b8537..a409f8a251db 100644
>> --- a/net/tls/tls_sw.c
>> +++ b/net/tls/tls_sw.c
>> @@ -1839,8 +1839,14 @@ int tls_sw_recvmsg(struct sock *sk,
>>  			if (bpf_strp_enabled) {
>>  				/* BPF may try to queue the skb */
>>  				__skb_unlink(skb, &ctx->rx_list);
>> +
>>  				err = sk_psock_tls_strp_read(psock, skb);
>> +
>>  				if (err != __SK_PASS) {
>> +                    if (err == __SK_REDIRECT) {
>> +                        skb->data += rxm->offset;
>> +                        skb->len = rxm->full_len;
>> +                    }
> 

Read the code with the following ktls + ebpf redirect model,
we may find the problem for the above modfication.

sk_psock_tls_strp_read()
->sk_psock_tls_verdict_apply()
  ->sk_psock_skb_redirect()
    ->skb_queue_tail() // add skb to ingress_skb queue
      schedule_work() [ sk_psock_backlog ] // sk_psock_tls_strp_read return

Walk through sk_psock->ingress_skb queue in sk_psock_backlog(),
after sk_psock_handle_skb(), it will kfree_skb() if not ingress.
So I think the above modfication has race problem about skb.
And I backport the above modfication to linux-5.10 as following:

--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -1882,6 +1882,10 @@ int tls_sw_recvmsg(struct sock *sk,
                        if (bpf_strp_enabled) {
                                err = sk_psock_tls_strp_read(psock, skb);
                                if (err != __SK_PASS) {
+                                       if (err == __SK_REDIRECT) {
+                                               skb->data += rxm->offset;
+                                               skb->len = rxm->full_len;
+                                       }
                                        rxm->offset = rxm->offset + rxm->full_len;

Run the following ktls + ebpf redirect model, occur system panic.

ktls       ktls
sender     proxy_recv         proxy_send        recv
 |            |                                  |
 |          verdict ---------------+             |
 |            |    redirect_egress |             |
 +------------+                    +-------------+

Call trace as following. It's may be not the race problem about skb,
but the above modfication is not OK for the ktls + ebpf redirect model.

[  816.687852] BUG: unable to handle page fault for address: ffff8f057ffdf000
[  816.687998] Oops: 0000 [#1] SMP PTI
[  816.688025] CPU: 2 PID: 619 Comm: kworker/2:3 Kdump: loaded Not tainted 5.10.0-ktls+ #4
[  816.688122] Workqueue: events sk_psock_backlog
[  816.688152] RIP: 0010:memcpy_erms+0x6/0x10
[  816.688263] RSP: 0018:ffff9c60c01cfc58 EFLAGS: 00010286
[  816.688292] RAX: ffff8f054f880000 RBX: 0000000000008000 RCX: 00000000000044bb
[  816.688330] RDX: 0000000000008000 RSI: ffff8f057ffdf000 RDI: ffff8f054f883b45
[  816.688376] RBP: 0000000000008000 R08: 0000000000000001 R09: 0000000000000000
[  816.688412] R10: 0000000000000001 R11: ffff9c60c01cfdd0 R12: ffff9c60c01cfdd0
[  816.688447] R13: ffff8f054f880000 R14: ffff9c60c01cfdb0 R15: 0000000000008000
[  816.688480] FS:  0000000000000000(0000) GS:ffff8f0775d00000(0000) knlGS:0000000000000000
[  816.688524] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  816.688564] CR2: ffff8f057ffdf000 CR3: 0000000120b16005 CR4: 0000000000370ee0
[  816.688600] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  816.688631] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  816.688661] Call Trace:
[  816.688681]  _copy_from_iter_full+0x1ee/0x270
[  816.688725]  tcp_sendmsg_locked+0x5aa/0xdc0
[  816.688749]  tcp_sendmsg+0x28/0x40
[  816.688769]  sock_sendmsg+0x57/0x60
[  816.688790]  ? sendmsg_unlocked+0x20/0x20
[  816.688811]  __skb_send_sock+0x29b/0x2f0
[  816.688847]  ? linear_to_page+0xf0/0xf0
[  816.688868]  sk_psock_backlog+0x85/0x1d0
[  816.688890]  process_one_work+0x1b0/0x350
[  816.688911]  worker_thread+0x49/0x300


In addition, I'm trying to explain performance degradation for the
ktls + ebpf redirect model compare to the following general proxy
model using TLS 1.3 (which are all using non-zerocopy), even though
ktls + ebpf redirect model reduce two cross-kernel data copies.
Now I find a memcpy_erms hot spot for _copy_from_iter_full() in
tcp_sendmsg_locked  within redirect_egress process. I suspect that
copy efficiency for the kvec is not good. But this just my guess, I
need to verify.

ktls           ktls
sender         proxy_recv proxy_send      recv
 |                |           |           |
 +----------------+           +-----------+

If you have some ideas, hope you can give me some advice.

Thand you!

> IDK what this is trying to do but I certainly depends on the fact 
> we run skb_cow_data() and is not "generally correct" :S
> .
>