Re: Detailed report on SMB-build lockups [seems that it is locking problem in networking code] (2.4.0-test2-ac2 and later)

"Andi Kleen" <ak@suse.de> · Tue, 11 Jul 2000 13:32:04 +0200

[cc list trimmed a bit]

On Tue, Jul 11, 2000 at 12:50:32PM +0200, Alexander Demenshin wrote:
> 		- Traffic generator used on _local_ interface:
> 		
> 			> A lot of fragmented packets:
> 			
> 				ifconfig lo mtu 256
> 				ping -f -s 8192 127.0.0.1
> 				
> 			> A lot of TCP traffic (connect/transfer/disconnect);
> 			> MTU does not matter.
> 			
> 	In my tests I used the following rules for iptables:
> 	
> 		iptables -t mangle -A PREROUTING -j QUEUE
> 		iptables -t mangle -A OUTPUT     -j QUEUE
> 		
> 	I assume there are no other rules; but the problem occurs _only_
> 	when QUEUE target is in effect - other rules does not matter as long
> 	as there is no QUEUE targets or if packets are not accepted in userspace.

The only thing I can see in ipqueue is that it turns off local bottom halves
for a long time during packet receive. That could probably force other
races.

> 	In case if I use table 'filter' it also occurs (so nothing magical
> 	in 'mangle' table).
> 	
> 	So, once rules above are in effect, userspace module is running, and after
> 	certain period of time running traffic generator system lockup occurs
> 	(in my case - after processing of ca. 300K packets; but it depends - 
> 	be patient :).
> 	
> 	No OOPs, no other kernel messages, _nothing_ except SysRq is active.
> 	
> 	Examining of code under EIP shows, that lockup occurs at:
> 	
> 		- In case of TCP traffic:
> 		
> 			src/net/ipv4/tcp_timer.c:690
> 			
> --- src/net/ipv4/tcp_timer.c:690 tcp_synack_timer() ---
>                                 /* Drop this request */
>                                 write_lock(&tp->syn_wait_lock);		/* <<< AT THIS PLACE */

This one is strange. Any chance to get a multi CPU backtrace for this  ?
(install kdb from oss.sgi.com:/projects/kdb/ , press pause during a hang,
enter bt and switch to the other CPUs using the cpu command and backtrace
them too) 

>                                 *reqp = req->dl_next;
>                                 write_unlock(&tp->syn_wait_lock);
> 
> --- CUT ---
> 
> 		- In case of ICMP (fragmented) traffic:
> 		
> --- src/net/ipv4/ip_fragment:202 ip_expire ---
>         spin_lock(&ipfrag_lock);					/* <<< AT THIS PLACE */

The fragment locking is known to be buggy. It should be fixed in 2.4.0pre3.
Also there was a NAT bug that it called ip_defrag without bhs turned off
that could cause deadlocks too, but that should be already fixed
(all ip_defrag calls in netfilter/* should be guarded by a local_bh_disable/
enable) 

-Andi
-
: send the line "unsubscribe linux-net" in
the body of a message to majordomo@vger.rutgers.edu