Re: [PATCH bpf-next v3 1/2] net: bpf: Always call BPF cgroup filters for egress.

Daniel Borkmann <daniel@xxxxxxxxxxxxx> · Fri, 23 Jun 2023 10:50:48 +0200

On 6/23/23 1:55 AM, Kui-Feng Lee wrote:
On 6/22/23 13:06, Daniel Borkmann wrote:
On 6/22/23 8:28 PM, Yonghong Song wrote:
On 6/22/23 10:15 AM, Kui-Feng Lee wrote:
On 6/21/23 20:37, Yonghong Song wrote:
On 6/20/23 10:14 AM, Kui-Feng Lee wrote:
Always call BPF filters if CGROUP BPF is enabled for EGRESS without
checking skb->sk against sk.

The filters were called only if skb is owned by the sock that the
skb is sent out through.  In another words, skb->sk should point to
the sock that it is sending through its egress.  However, the filters would
miss SYNACK skbs that they are owned by a request_sock but sent through
the listening sock, that is the socket listening incoming connections.
This is an unnecessary restrict.

The original patch which introduced 'sk == skb->sk' is
   3007098494be  cgroup: add support for eBPF programs
There are no mentioning in commit message why 'sk == skb->sk'
is needed. So it is possible that this is just restricted
for use cases at that moment. Now there are use cases
where 'sk != skb->sk' so removing this check can enable
the new use case. Maybe you can add this into your commit
message so people can understand the history of 'sk == skb->sk'.

After checking the code and the Alexei's comment[1] again, this check
may be different from what I thought. In another post[2],
Daniel Borkmann mentioned

     Wouldn't that mean however, when you go through stacked devices that
     you'd run the same eBPF cgroup program for skb->sk multiple times?

I read this paragraph several times.
This check ensures the filters are only called for the device on
the top of a stack.  So, I probably should change the check to

     sk == skb_to_full_sk(skb)

I think this should work. It exactly covers your use case:
   they are owned by a request_sock but sent through
   the listening sock, that is the socket listening incoming connections
and sk == skb->sk for non request_sock/listening_sock case.

Just a thought, should the test look like the below?

         int __ret = 0;                                                         \
         if (cgroup_bpf_enabled(CGROUP_INET_EGRESS) && sk) {                    \
                 typeof(sk) __sk = sk_to_full_sk(sk);                           \
                 if (sk_fullsock(__sk) && __sk == skb_to_full_sk(skb) &&        \
                     cgroup_bpf_sock_enabled(__sk, CGROUP_INET_EGRESS))         \
                         __ret = __cgroup_bpf_run_filter_skb(__sk, skb,         \
CGROUP_INET_EGRESS); \
}                                                                      \

Iow, we do already convert __sk to full sk, so we should then also use that
for the test with skb_to_full_sk(skb).

Agree!

It would also be useful to do an in-depth analysis for the commit msg in which
cases the sk == skb->sk matches and sk was not a full sock (but __sk is) given
the __sk = sk_to_full_sk(sk) exists in the code to document which situation this
is covering in the existing code (... perhaps it used to work back then for
synack just that later changes altered it without anyone noticing until now).

Thanks,
Daniel