Jakub Sitnicki wrote: > On Wed, Jun 03, 2020 at 08:13 AM CEST, Eric Dumazet wrote: > > On 3/10/20 9:41 AM, John Fastabend wrote: > >> The bucket->lock is not needed in the sock_hash_free and sock_map_free > >> calls, in fact it is causing a splat due to being inside rcu block. > >> [...] >> diff --git a/net/core/sock_map.c b/net/core/sock_map.c > >> index 085cef5..b70c844 100644 > >> --- a/net/core/sock_map.c > >> +++ b/net/core/sock_map.c > >> @@ -233,8 +233,11 @@ static void sock_map_free(struct bpf_map *map) > >> struct bpf_stab *stab = container_of(map, struct bpf_stab, map); > >> int i; > >> > >> + /* After the sync no updates or deletes will be in-flight so it > >> + * is safe to walk map and remove entries without risking a race > >> + * in EEXIST update case. > > > > > > What prevents other cpus from deleting stuff in sock_hash_delete_elem() ? > > > > What state has been changed before the synchronize_rcu() call here, > > that other cpus check before attempting a delete ? > > > > Typically, synchronize_rcu() only makes sense if readers can not start a new cycle. > > > > A possible fix would be to check in sock_hash_delete_elem() (and possibly others methods) > > if map->refcnt is not zero. > > > > syzbot found : (no repro yet) > > > > general protection fault, probably for non-canonical address 0xfbd59c0000000024: 0000 [#1] PREEMPT SMP KASAN > > KASAN: maybe wild-memory-access in range [0xdead000000000120-0xdead000000000127] > > CPU: 2 PID: 14305 Comm: kworker/2:3 Not tainted 5.7.0-syzkaller #0 > > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 > > Workqueue: events bpf_map_free_deferred > > RIP: 0010:__write_once_size include/linux/compiler.h:279 [inline] > > RIP: 0010:__hlist_del include/linux/list.h:811 [inline] > > RIP: 0010:hlist_del_rcu include/linux/rculist.h:485 [inline] > > RIP: 0010:sock_hash_free+0x202/0x4a0 net/core/sock_map.c:1021 > > Code: 0f 85 15 02 00 00 4c 8d 7b 28 4c 8b 63 20 4c 89 f8 48 c1 e8 03 80 3c 28 00 0f 85 47 02 00 00 4c 8b 6b 28 4c 89 e8 48 c1 e8 03 <80> 3c 28 00 0f 85 25 02 00 00 4d 85 e4 4d 89 65 00 74 20 e8 f6 82 [...] Thanks Eric. > My initial reasoning behind the change was that sock_hash_delete_elem() > callers hold a ref to sockhash [0]. Either because there is an open FD > for the map, or the map is in use by loaded BPF program. The same > applies to updates. > > If that holds, map->refcnt is > 0, and we should not see the map being > freed at the same time as sock_hash_delete_elem() happens. > > But then there is also sock_hash_delete_from_link() that deletes from > sockhash when a sock/psock unlinks itself from the map. This operation > happens without holding a ref to the map, so that sockets won't keep the > map "alive". There is no corresponding *_update_from_link() for updates > without holding a ref. Yep we missed this case :/ > > Sadly, I can't spot anything preventing list mutation, hlist_del_rcu(), > from happening both in sock_hash_delete_elem() and sock_hash_free() > concurrently, now that the bucket spin-lock doesn't protect it any > more. That is what I understand syzbot is reporting. Agreed. > > synchronize_rcu() before we walk the htable doesn't rule it out, because > as you point out, new readers can start a new cycle, and we don't change > any state that would signal that the map is going away. > > I'm not sure that the check for map->refcnt when sock is unlinking > itself from the map will do it. I worry we will then have issues when > sockhash is unlinking itself from socks (so the other way around) in > sock_hash_free(). We could no longer assume that the sock & psock > exists. > > What comes to mind is to reintroduce the spin-lock protected critical > section in sock_hash_free(), but delay the processing of sockets to be > unlinked from sockhash. We could grab a ref to sk_psock while holding a > spin-lock and unlink it while no longer in atomic critical section. It seems so. In sock_hash_free we logically need, for (i = 0; i < htab->buckets_num; i++) { hlist_for_each_entryy_safe(...) { hlist_del_rcu() <- detached from bucket and no longer reachable synchronize_rcu() // now element can not be reached from unhash() ... sock_map_unref(elem->sk, elem) ... } } We don't actually want to stick a synchronize_rcu() in that loop so I agree we need to collect the elements do a sync then remove them. > > Either way, Eric, thank you for the report and the pointers. +1 > > John, WDYT? Want to give it a try? Or I can draft something. > > [0] https://lore.kernel.org/netdev/8736boor55.fsf@xxxxxxxxxxxxxx/ [...]