Re: Checksum problem on sparc64

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

Some time ago, I sent a message about bad TCP checkums in some packets.
I've been able to get more free time and track the bug. Here is the story :

The bug is triggered by tcp_fragment() in net/ipv4/tcp_output.c (I also
had problems with IPv6, but have not yet checked that the origin of the
problem is the same - it probably is). The code involved is :

if (!skb_shinfo(skb)->nr_frags && skb->ip_summed != CHECKSUM_HW) {
	/* Copy and checksum data tail into the new buffer. */
	buff->csum = csum_partial_copy_nocheck(skb->data + len, skb_put(buff, nsize), nsize, 0);

	skb_trim(skb, len);

	skb->csum = csum_block_sub(skb->csum, buff->csum, len);
}

As many Sun machines have NICs that support hardware TCP/UDP
checksumming, you really need to force linux to use the software
implementation. One way to trigger the bug is to transfer a large amount
of data between two hosts in a LAN, and in the middle of the process,
lower the MTU of the sparc interface. tcp_fragment() should be called.
Depending on the content of the packets, you may or may not get a bad
checksum (let's say it's somehow pseudo-random). When grabbing packets
with tcpdump, don't forget to use -s 0, so that tcpdump gets all the
content of each packet, in order to calculate and check the TCP checksum.

When csum_block_sub() is called, it calls in turn csum_sub() and
csum_add(). In csum_sub(), the second parameter sent to csum_block_sub()
is complemented before being sent to csum_add() (in one's complement
arithmetic, subtracting is adding the complement). But on sparc64, the
csum_partial_xxx functions returns 16 bits words (actually 32 bits words
so that there is room for a carry bit). The value returned by
csum_block_sub() may have the 16 MSB bits set. Later, it is used in
tcp_v4_send_check() (net/ipv4/tcp_ipv4.c), in csum_partial() :
th->check = tcp_v4_check(th, len, inet->saddr, inet->daddr,
                         csum_partial((char *)th,
                                      th->doff << 2,
                                      skb->csum));

After lots of checks, it seems that the value returned by csum_partial()
is causing the problem. When skb->csum has the 16 MSB bits set,
csum_partial() forgets to add a carry bit. E.g. :
skb->csum = 0xffffabcd
tcphdr_csum = 0xdcba
skb->csum + tcphdr_csum = 0x100008887

The MSB bit is the 33rd bit and is apparently silently ignored.
The partial checksum is then 1 less than it should be. When returning
the complement of that partial checksum, the value is then 1 more than
it should be.

Here is a tcpdump output from my last message :

13:14:30.397874 > 0800 1416: IP (tos 0x8, ttl  63, id 26859, offset 0, flags
[DF], length: 1400) 10.0.0.2.22 > 84.96.34.158.59002: . [bad tcp cksum 696d
(->696c)!] 3016044:3017392(1348) ack 7729 win 6788 <nop,nop,timestamp 34000990
1795720>

You can clearly see that the checksum calculated by tcpdump is 1 more than
the packet checksum.

When studying the sparc64-specific csum_partial() assembly function
(arch/sparc64/lib/checksum.S), I noticed that a 64 bits register is used
to compute the checksum of the given data. But when adding the sum
parameter to that computed value, the code directly folds from 32 bits
to 16 bits. This is where the carry bit is lost. Unfortunately, I'm not
skilled enough in sparc64 assembly language to provide a functional patch
(my tests included a C version of csum_partial()) but I guess it won't
be difficult for any hacker around (say David Miller :-)) to fix this.
After all, it was only about a carry bit. Just a little bit...

Hope it helps :-).

-- 
Richard Braun

Attachment: signature.asc
Description: Digital signature


[Index of Archives]     [Kernel Development]     [DCCP]     [Linux ARM Development]     [Linux]     [Photo]     [Yosemite Help]     [Linux ARM Kernel]     [Linux SCSI]     [Linux x86_64]     [Linux Hams]

  Powered by Linux