Re: [bpf PATCH] bpf: sockmap, remove bucket->lock from sock_{hash|map}_free

John Fastabend <john.fastabend@xxxxxxxxx> · Wed, 03 Jun 2020 11:35:41 -0700

Jakub Sitnicki wrote:
> On Wed, Jun 03, 2020 at 08:13 AM CEST, Eric Dumazet wrote:
> > On 3/10/20 9:41 AM, John Fastabend wrote:
> >> The bucket->lock is not needed in the sock_hash_free and sock_map_free
> >> calls, in fact it is causing a splat due to being inside rcu block.
> >>

[...]

>> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
> >> index 085cef5..b70c844 100644
> >> --- a/net/core/sock_map.c
> >> +++ b/net/core/sock_map.c
> >> @@ -233,8 +233,11 @@ static void sock_map_free(struct bpf_map *map)
> >>  	struct bpf_stab *stab = container_of(map, struct bpf_stab, map);
> >>  	int i;
> >>
> >> +	/* After the sync no updates or deletes will be in-flight so it
> >> +	 * is safe to walk map and remove entries without risking a race
> >> +	 * in EEXIST update case.
> >
> >
> > What prevents other cpus from deleting stuff in sock_hash_delete_elem() ?
> >
> > What state has been changed before the synchronize_rcu() call here,
> > that other cpus check before attempting a delete ?
> >
> > Typically, synchronize_rcu() only makes sense if readers can not start a new cycle.
> >
> > A possible fix would be to check in sock_hash_delete_elem() (and possibly others methods)
> > if map->refcnt is not zero.
> >
> > syzbot found : (no repro yet)
> >
> > general protection fault, probably for non-canonical address 0xfbd59c0000000024: 0000 [#1] PREEMPT SMP KASAN
> > KASAN: maybe wild-memory-access in range [0xdead000000000120-0xdead000000000127]
> > CPU: 2 PID: 14305 Comm: kworker/2:3 Not tainted 5.7.0-syzkaller #0
> > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
> > Workqueue: events bpf_map_free_deferred
> > RIP: 0010:__write_once_size include/linux/compiler.h:279 [inline]
> > RIP: 0010:__hlist_del include/linux/list.h:811 [inline]
> > RIP: 0010:hlist_del_rcu include/linux/rculist.h:485 [inline]
> > RIP: 0010:sock_hash_free+0x202/0x4a0 net/core/sock_map.c:1021
> > Code: 0f 85 15 02 00 00 4c 8d 7b 28 4c 8b 63 20 4c 89 f8 48 c1 e8 03 80 3c 28 00 0f 85 47 02 00 00 4c 8b 6b 28 4c 89 e8 48 c1 e8 03 <80> 3c 28 00 0f 85 25 02 00 00 4d 85 e4 4d 89 65 00 74 20 e8 f6 82

[...]

Thanks Eric.

> My initial reasoning behind the change was that sock_hash_delete_elem()
> callers hold a ref to sockhash [0]. Either because there is an open FD
> for the map, or the map is in use by loaded BPF program. The same
> applies to updates.
> 
> If that holds, map->refcnt is > 0, and we should not see the map being
> freed at the same time as sock_hash_delete_elem() happens.
> 
> But then there is also sock_hash_delete_from_link() that deletes from
> sockhash when a sock/psock unlinks itself from the map. This operation
> happens without holding a ref to the map, so that sockets won't keep the
> map "alive". There is no corresponding *_update_from_link() for updates
> without holding a ref.

Yep we missed this case :/

> 
> Sadly, I can't spot anything preventing list mutation, hlist_del_rcu(),
> from happening both in sock_hash_delete_elem() and sock_hash_free()
> concurrently, now that the bucket spin-lock doesn't protect it any
> more. That is what I understand syzbot is reporting.

Agreed.

> 
> synchronize_rcu() before we walk the htable doesn't rule it out, because
> as you point out, new readers can start a new cycle, and we don't change
> any state that would signal that the map is going away.
> 
> I'm not sure that the check for map->refcnt when sock is unlinking
> itself from the map will do it. I worry we will then have issues when
> sockhash is unlinking itself from socks (so the other way around) in
> sock_hash_free(). We could no longer assume that the sock & psock
> exists.
> 
> What comes to mind is to reintroduce the spin-lock protected critical
> section in sock_hash_free(), but delay the processing of sockets to be
> unlinked from sockhash. We could grab a ref to sk_psock while holding a
> spin-lock and unlink it while no longer in atomic critical section.

It seems so. In sock_hash_free we logically need,

 for (i = 0; i < htab->buckets_num; i++) {
  hlist_for_each_entryy_safe(...) {
  	hlist_del_rcu() <- detached from bucket and no longer reachable
        synchronize_rcu()
        // now element can not be reached from unhash()
	... sock_map_unref(elem->sk, elem) ...
  }
 } 

We don't actually want to stick a synchronize_rcu() in that loop
so I agree we need to collect the elements do a sync then remove them.

> 
> Either way, Eric, thank you for the report and the pointers.

+1

> 
> John, WDYT?

Want to give it a try? Or I can draft something.

> 
> [0] https://lore.kernel.org/netdev/8736boor55.fsf@xxxxxxxxxxxxxx/

[...]