On 2022/05/03 22:45, Eric Dumazet wrote: >> This looks equivalent to the fix presented here: >> >> https://lore.kernel.org/all/CANn89i+484ffqb93aQm1N-tjxxvb3WDKX0EbD7318RwRgsatjw@xxxxxxxxxxxxxx/ I retested the fix above using unshare -n sh -c ' ip link set lo up iptables -A OUTPUT -p tcp --sport 16385 --tcp-flags SYN NONE -m state --state ESTABLISHED,RELATED -j DROP ip6tables -A OUTPUT -p tcp --sport 16385 --tcp-flags SYN NONE -m state --state ESTABLISHED,RELATED -j DROP telnet 127.0.0.1 16385 dmesg -c netstat -tanpe' < /dev/null as a test case, but it seems racy; sometimes timer function is called again and crashes. [ 426.086565][ C2] general protection fault, probably for non-canonical address 0x6b6af3ebcc3b6bc3: 0000 [#1] PREEMPT SMP KASAN [ 426.096339][ C2] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.18.0-rc5-dirty #807 [ 426.103769][ C2] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [ 426.111851][ C2] RIP: 0010:__tcp_transmit_skb+0xe72/0x1b80 [ 426.117512][ C2] Code: e8 b3 ea dc fd 48 8d 7d 30 45 0f b7 77 30 e8 95 ec dc fd 48 8b 5d 30 48 8d bb b8 02 00 00 e8 85 ec dc fd 48 8b 83 b8 02 00 00 <65> 4c 01 70 58 e9 67 fd ff ff e8 ef 56 ac fd 48 8d bd d0 09 00 00 [ 426.124692][ C2] RSP: 0018:ffff888060d09ac8 EFLAGS: 00010246 [ 426.126845][ C2] RAX: 6b6b6b6b6b6b6b6b RBX: ffff8880145c8000 RCX: ffffffff838cc28b [ 426.129616][ C2] RDX: dffffc0000000000 RSI: 0000000000000001 RDI: ffff8880145c82b8 [ 426.132374][ C2] RBP: ffff8880129f8000 R08: 0000000000000000 R09: 0000000000000007 [ 426.135077][ C2] R10: ffffffff838cbfd4 R11: 0000000000000001 R12: ffff8880129f8760 [ 426.137793][ C2] R13: ffff88800f6e0118 R14: 0000000000000001 R15: ffff88800f6e00e8 [ 426.140489][ C2] FS: 0000000000000000(0000) GS:ffff888060d00000(0000) knlGS:0000000000000000 [ 426.143525][ C2] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 426.145792][ C2] CR2: 000055b5bb0adabc CR3: 000000000e003000 CR4: 00000000000506e0 [ 426.148509][ C2] Call Trace: [ 426.149442][ C2] <IRQ> [ 426.150183][ C2] ? __tcp_select_window+0x710/0x710 [ 426.151457][ C2] ? __sanitizer_cov_trace_cmp4+0x1c/0x70 [ 426.153007][ C2] ? tcp_current_mss+0x165/0x280 [ 426.154245][ C2] ? tcp_trim_head+0x300/0x300 [ 426.155396][ C2] ? find_held_lock+0x85/0xa0 [ 426.156734][ C2] ? mark_held_locks+0x65/0x90 [ 426.157967][ C2] tcp_write_wakeup+0x2e2/0x340 [ 426.159149][ C2] tcp_send_probe0+0x2a/0x2c0 [ 426.160368][ C2] tcp_write_timer_handler+0x5cb/0x670 [ 426.161740][ C2] tcp_write_timer+0x86/0x250 [ 426.162896][ C2] ? tcp_write_timer_handler+0x670/0x670 [ 426.164285][ C2] call_timer_fn+0x15d/0x5f0 [ 426.165481][ C2] ? add_timer_on+0x2e0/0x2e0 [ 426.166667][ C2] ? lock_downgrade+0x3c0/0x3c0 [ 426.167921][ C2] ? mark_held_locks+0x24/0x90 [ 426.169263][ C2] ? _raw_spin_unlock_irq+0x1f/0x40 [ 426.170564][ C2] ? tcp_write_timer_handler+0x670/0x670 [ 426.171920][ C2] __run_timers.part.0+0x523/0x740 [ 426.173181][ C2] ? call_timer_fn+0x5f0/0x5f0 [ 426.174321][ C2] ? pvclock_clocksource_read+0xdc/0x1a0 [ 426.175655][ C2] run_timer_softirq+0x66/0xe0 [ 426.176825][ C2] __do_softirq+0x1c2/0x670 [ 426.177944][ C2] __irq_exit_rcu+0xf8/0x140 [ 426.179120][ C2] irq_exit_rcu+0x5/0x20 [ 426.180150][ C2] sysvec_apic_timer_interrupt+0x8e/0xc0 [ 426.181486][ C2] </IRQ> [ 426.182180][ C2] <TASK> [ 426.182845][ C2] asm_sysvec_apic_timer_interrupt+0x12/0x20 > > I think this is still needed for layers (NFS ?) that dismantle their > TCP sockets whenever a netns > is dismantled. But RDS case was different, only the listener is a kernel socket. We can't apply the fix above. I think that the fundamental problem is that we use net->ns.count for both "avoiding use-after-free" purpose and "allowing dismantle from user event" purpose. Why not to use separated counters?