Thanks to your helpfull replies, I think I have now reached to a rather consistent understanding regarding the operation of the checksum-offloading, and scather-gather features found in the linux networking stack. Follows the same scatter-gather description I sent a couple of days ago polished-up a bit, and extented to also cover the checksum-offloading issues. Thanks again and sorry for my---sometimes silly---questions. Please excuse my linguistic ... atrocities (as English is not my native tongue) and feel free to ask questions, send corrections and suggestions, comment, or rant about it. Here it goes... ** ** Features of a Networking Driver / Device ** The "net_device" structure (defined in "include/linux/netdevice.h"), which is filled-in by a net driver at initialization time, includes a field called "features". By setting certain bits in this field the driver can inform the networking stack of it's capabilities. As of 2.4.20 the following features-masks are defined (in "include/linux/netdevice.h"), and can be declared by the driver: NETIF_F_SG Scatter/gather IO. NETIF_F_IP_CSUM Can checksum only TCP/UDP over IPv4. NETIF_F_NO_CSUM Does not require checksum. F.e. loopack. NETIF_F_HW_CSUM Can checksum all the packets. NETIF_F_DYNALLOC Self-dectructable device. NETIF_F_HIGHDMA Can DMA to high memory. NETIF_F_FRAGLIST Scatter/gather IO. NETIF_F_HW_VLAN_TX Transmit VLAN hw acceleration NETIF_F_HW_VLAN_RX Receive VLAN hw acceleration NETIF_F_HW_VLAN_FILTER Receive filtering on VLAN NETIF_F_VLAN_CHALLENGED Device cannot handle VLAN packets Follows a rather detailed description of the implications for the device-driver writers of the following features: NETIF_F_SG NETIF_F_NO_CSUM NETIF_F_IP_CSUM NETIF_F_HW_CSUM In summary: NETIF_F_SG must be enabled by drivers that are willing and able to hande "skb"s whose data are not physically continuous (i.e. that are fragmented). NETIF_F_NO_CSUM must be enabled by drivers servicing communication paths that are by-nature reliable, so checksums are not needed to protect the data from transmission errors. NETIF_F_IP_CSUM is for drivers and devices that can perform (presumably hardware-assisted) checksum-calculations *only* for TCP and UDP packets over IPv4. Finally NETIF_F_HW_CSUM is for drivers and devices that can perform hardware-assisted checksum calculations for all kinds of packets. ** ** Scatter-Gather: the "NETIF_F_SG" feature ** Among the feature bits, shown above, the "NETIF_F_SG" is the one the driver sets to indicate that it can do scatter-gather packet-processing. If "NETIF_F_SG" is not set, then the networking stack will make sure that the "skb"s hold *physically-continuous* data before passing them to the driver. This is taken care of in "net/core/dev.c:dev_queue_xmit()" like this: if (skb_shinfo(skb)->frag_list && !(dev->features&NETIF_F_FRAGLIST) && skb_linearize(skb, GFP_ATOMIC) != 0) { kfree_skb(skb); return -ENOMEM; } /* Fragmented skb is linearized if device does not support SG, * or if at least one of fragments is in highmem and device * does not support DMA from it. */ if (skb_shinfo(skb)->nr_frags && (!(dev->features&NETIF_F_SG) || illegal_highdma(dev, skb)) && skb_linearize(skb, GFP_ATOMIC) != 0) { kfree_skb(skb); return -ENOMEM; } As a result, when a driver's "hard_start_xmit()" function receives an skb, it knows that the data to be transmitted start at "skb->data", that their length is "skb->len", and that they are virtually and physically continuous. As a result the driver can directly pass the "skb->data" pointer to the device's DMA controller, after converting it to a physical address, and synchronizing the relevant cache entries (by calling something like "pci_map_single()"). If---on the other hand---the driver sets the "NETIF_F_SG" bit in the "features" field of the "net_device" structure (declaring that it *can* do scatter-gather DMA), then any skb passed to it, might very-well hold data that are not physically continuous (and sometimes not even virtually continuous). In this case for every "skb" passed to the driver the networking stack also fills-in a "skb_shared_info" structure, defined in "include/linux/skbuff.h", like this: struct skb_shared_info { atomic_t dataref; unsigned int nr_frags; struct sk_buff *frag_list; skb_frag_t frags[MAX_SKB_FRAGS]; }; This structure is pointed-to by the "end" field of the "sk_buff" structure, so it can be accessed by the driver as: (struct skb_shared_info *)skb->end or even better using the macro "skb_shinfo", which is essentially the same: skb_shinfo(skb) It should be obvious that, in the scatter-gather case, the frame to be transmitted consist of a sequence of fragments (parts), each of which keeps a virtually and physically continuous subset of the data. The start of the first fragment is pointed by "skb->data" (as in the non-SG case), but its length (in bytes) is "skb->len - skb->data_len" (which can also be accessed using the macro "skb_headlen()" defined in "include/linux/skbuff.h"). "skb->len" is still the length of the *full* frame (the sum of the lengths of all the fragments), and "skb->data_len" is the total length of all the data fragments not counting the first "header" fragment pointed by "skb->data". Actually a way to check if an skb is physically-continuous is to test if "skb->data_len" is non-zero; there is even a macro for this ("skb_is_nonlinear()") defined in "include/linux/skbuff.h". After the initial "header" fragment, there are exactly "skb_shinfo(skb)->nr_frags" fragments following. Each of these fragments is described by a "skb_frag_t" structure defined (in "include/linux/skbuff.h") as: struct skb_frag_struct { struct page *page; __u16 page_offset; __u16 size; }; ... typedef struct skb_frag_struct skb_frag_t; The "skb_frag_struct" structure corresponding to the I'th fragment can be accessed as: skb_shinfo(skb)->frags[I]; So the data of a non-linear "sk_buff" "skb" consist of the following parts, which are themselves linear (virtually and physically continuous): virtual virtual addr of addr of part # : first byte last byte ----------------------------------------- 0 : skb->data ... skb->len - skb->data_len - 1 1 : fr_adr(0) ... fr_adr(0) + fr_sz(0) - 1 . . nfrags : fr_adr(nfrags - 1) ... fr_adr(nfrags - 1) + fr_sz(nfrags - 1) - 1 where: "nfrags" is "skb_shinfo(skb)->nr_frags" and "fr_adr(i)" is "fr_pg_adr(i) + fr_pg_ofs(i)" "fr_pg_adr(i)" is "page_address(skb_shinfo(skb)->frags[i].page)" "fr_pg_ofs(i)" is "skb_shinfo(skb)->frags[i].page_offset" "fr_sz(i)" is "skb_shinfo(skb)->frags[i].size" NOTICE: "fr_adr", "fr_sz", "fr_pg_adr", and "fr_pg_ofs" are just symbolisms introduced to convenience our discussion, they are not actually defined as macros in the kernel. "page_address", on the other hand, is a real macro defined in "include/linux/mm.h" it also holds that: skb->data_len == fr_sz(0) + ... + fr_sz(nfrags - 1) For an example of how these are used in a real driver see "e100_main.c", and especially the function "e100_prepare_xmit_buff()" which contains all the details of handling the fragment-sequence. ** ** Checksum offloading: the "NETIF_F_??_CSUM" features ** Checksum offloading is used to relief the kernel (and thus the CPU) from the burden of calculating transport-PDU checksums (TCP and UDP checksums) when the NIC hardware can perform these calculations itself. It can be used for both: downstream (transmitted) packets, where the NIC calculates and embeds the checksum in the PDU, and upstream (received) packets where the NIC pre-calculates and passes to the networking-stack the transport-PDU checksum of the received packet. The checksums of lower-layer PDUs do not come into the checksum-offloading process, since the kernel always calculates the network-layer checksums (IP header checksums) itself. The two directions are treated differently by the kernel (and it would be possible for a device to support checksum-offloading in one of them only), so they will be described separately here, starting with the downstream direction. ** Downstream (transmit-path) checksum offloading In order for the networking stack to know that a device can, and is willing to, provide checksum offloading services downstream, the driver must---at initialization time---set either the "NETIF_F_IP_CSUM" or the "NETIF_F_HW_CSUM" flag in the "features" field of the "net_device" structure. If the "NETIF_F_HW_CSUM" flag is set, then the kernel will suppress the calculation of transport-PDU checksums for all kinds of packets, and will set the "skb->ip_summed" field of the "skb"s delivered to the driver to "CHECKSUM_HW". Furthermore the "skb->csum" field will be set to the byte-offset of the checksum-field in the transport-PDU header. In order to help the hardware checksum-calculation unit, the kernel will also compute the pseudo-header transport-PDU checksum and store it in the checksum field inside the PDU. The "NETIF_F_IP_CSUM" case is almost identical, with the only difference that the kernel will instead fully calculate transport-PDU checksums and set "skb->ip_summed" to CHECKSUM_NONE, for any packet other than the ones carrying TCP and UDP PDUs. Another more ... radical checksum-offloading mode, can be indicated by setting the "NETIF_F_NO_CSUM" flag in "dev->features". In this case the kernel will completely suppress checksum calculation, under the assumption that the physical medium used by the device is totally reliable, and no checksum protection is required. Furthermore the kernel will set the "skb->ip_summed" field of the skbs delivered to the driver to "CHECKSUM_UNNECESSARY". This mode must only be used for devices that do not actually transmit packets to real-world media (e.g. the loopback device). In summary (from the kernel's viewpoint): NETIF_F_NO_CSUM set: no checksum is calculated, the medium is considered reliable. NETIF_F_IP_CSUM set: Only pseudo-header checksum is calculated for the transport-PDU of TCP and UDP packets. Full transport-PDU checksum is calculated for any other packet. NETIF_F_HW_CSUM set: Only pseudo-header transport-PDU checksum is calculated for all packets (where it makes sense). >From a driver's point of view, the following are possible for an "skb" delivered to it by the stack: skb->ip_summed == CHECKSUM_UNNECESSARY: skb->h.th->check <-- invalid skb->csum <-- invalid ( the device is considered reliable, no checksum should be calculated. transmit the packet as-is. ) skb->ip_summed == CHECKSUM_HW: skb->h.th->check <-- pseudo-header checksum of the transport-PDU skb->csum <-- offset of the transport-PDU checksum field from the beginning of the transport-PDU header. That is: sbk->csum = skb->h.th->check - skb->h.raw ( the driver / device must complete the PDU checksum calculation, and store the result in skb->h.raw + skb->csum, before transmitting the packet ) skb->ip_summed == CHECKSUM_NONE: skb->h.th->check <-- fully calculated transport-PDU checksum skb->csum <-- invalid ( all checksum calculations have been performed by the kernel. Just transmit the packet. ) What types of packets are actually delivered to a driver (and therefore what types of packets the driver must be prepared to handle) depends on the features it has declared. Namely: NETIF_F_NO_CSUM: only CHECKSUM_UNNECESSARY packets. NETIF_F_IP_CSUM: CHECKSUM_NONE or CHECKSUM_HW packets. Furthermore CHECKSUM_HW packets will only contain TCP or IP transport PDUs. NETIF_F_HW_CSUM: CHECKSUM_NONE or CHECKSUM_HW packets. An example of a real-world driver that advertises, and handles, the NETIF_IF_HW_CSUM capability is the e100 driver found in "drivers/net/e100/". The downstream checksum-offloading features are treated in "drivers/net/e100/e100_main.c:e100_prepare_xmit_buff()". ** Upstream (receive-path) checksum offloading Now let's turn our attention to upstream (receive-path) checksum offloading. It is essential to point-out that the capabilities declared through "dev->features" have nothing to do with the upstream handling of checksum offloading; those are only relevant for the downstream path. In the eyes of the networking stack, and as far as upstream checksum offloading is concerned, networking devices belong in three classes, according to their capabilities and operational mode. The first class is "dumb" devices, that can do no hardware checksum calculation at all. For every packet received by such devices the respective driver must set the "skb->ip_summed" field to "CHECKSUM_NONE" before delivering the skb to the stack. The network stack will then calculate and verify (try to match) the transport-PDU checksum, and act accordingly. The second class is devices that can "opaquely" calculate and verify any checksums in the incoming packets and report success-or-failure to the driver (without actually reporting the values of any of the calculated checksums---hence "opaquely"). For such devices, before passing the skb to the network stack, the driver must set "skb->ip_summed" to "CHECKSUM_UNNECESSARY" if the device reported success ("packet is good"), and to "CHECKSUM_NONE" if it reported failure ("packet is bad", or "I can't parse this"). It is interesting to note that in this later case the transport-PDU checksum will be actually re-calculated by the kernel (although the packet is a-priori known to be bad), and only then the packet will be dropped. The disadvantage of this approach is that packets flagged as "bogus" by the hardware, are still considered by the kernel, and waste some resources. The advantage is that devices that claim to be able to calculate the checksums of all possible packets (of past, present and future transport-layer protocols) are probably lying. Additionally it is not impossible to imagine a device that confuses the "packet is bad" with the "can't parse this" verdicts. Under this light, giving bad-looking packets a second-chance may not be a very bad idea. The third class is "smart" devices that are able to calculate, and make available to the CPU, a partial-checksum of the transport-layer PDU. Specifically the partial-checksum that must be calculated by the device must cover (for TCP and UDP transports) the PDU-header and the PDU-text, but *not* the pseudo-header. For these devices the driver must obtain the hardware-calculated partial-checksum and store it in "skb->csum", then set "skb->ip_summed" to CHECKSUM_HW, and finally push the packet upstream. The network stack will then complete the checksum in "skb->csum" by calculating---on top of it---the pseudo-header checksum, and if everything turns-out ok, accept the packet. Otherwise the stack will recalculate the PDU checksum from scratch, and if it still fails, drop the packet. Summarizing, the operations that need to be performed by a driver for upstream checksum offloading are: Dump devices: - set "skb->ip_summed" to "CHECKSUM_NONE" - push the packet upstream Opaque devices: - If the device reported success ("good packet") set "skb->ip_summed" to "CHECKSUM_UNNECESSARY" else, if the device reported failure ("bad packet" or "I can't parse this") set "skb->ip_summed" to "CHECKSUM_NONE" - push the packet upstream Smart devices: - Obtain the partial transport-PDU checksum from the device (it must be the checksum of the PDU-header and the PDU-data, without taking into account the pseudo-header) - set "skb->csum" to the partial checksum calculated by the NIC hardware - set "skb->ip_summed" to "CHECKSUM_HW" The driver for the "e100" NIC included in the kernel sources, handles upstream checksum offloading for "opaque" and "smart" variants of the e100 adapter. Examples can be sought in "drivers/net/e100_main.c", and specifically in function "e100_rx_srv()". See also the "forensic evidence" section just below. ** ** Forensic evidence (collected from the crime-scene of 2.4.20) ** * Downstream (transmit-path) checksum offloading: If the driver does not advertise compatible checksum-offloading features, then the packet is checksumed by the kernel, as can be seen in: "net/core/dev:dev_queue_xmit()" /* If packet is not checksummed and device does not support * checksumming for this protocol, complete checksumming here. */ if (skb->ip_summed == CHECKSUM_HW && (!(dev->features&(NETIF_F_HW_CSUM|NETIF_F_NO_CSUM)) && (!(dev->features&NETIF_F_IP_CSUM) || skb->protocol != htons(ETH_P_IP)))) { if ((skb = skb_checksum_help(skb)) == NULL) return -ENOMEM; } An observant reader who traces this code down the "skb_checksum_help()" execution path, will see that actually only the TCP header and TCP text are checksumed, but *not* the pseudo-header as prescribed by RFC-739 and RFC-768. This is because the pseudo-header is always checksumed higher-up the network layer. For TCP packets this happens in the function "net/ipv4/tcp_ipv4.c:tcp_v4_send_check()": <reformatted> /* This routine computes an IPv4 TCP checksum. */ void tcp_v4_send_check(struct sock *sk, struct tcphdr *th, int len, struct sk_buff *skb) { if (skb->ip_summed == CHECKSUM_HW) { th->check = ~tcp_v4_check(th, len, sk->saddr, sk->daddr, 0); skb->csum = offsetof(struct tcphdr, check); } else { th->check = tcp_v4_check(th, len, sk->saddr, sk->daddr, csum_partial((char *)th, th->doff<<2, skb->csum)); } } As you can see, if ip_summed == CHECKSUM_HW, then the pseudo-header checksum is calculated, and "skb->csum" is set to the offset of the checksum-field from the beginning of the TCP header. Otherwise, the the checksum of the pseudo-header, and of the TCP-header are calculated on-top of "skb->csum", which presumably holds the text-checksum computed earlier. "tcp_v4_send_check()" is called through the jump-table structure "ipv4_specific" which is initialized in "net/ipv4/tcp_ipv4.c", like this: struct tcp_func ipv4_specific = { ip_queue_xmit, tcp_v4_send_check, tcp_v4_rebuild_header, it is called from "net/ipv4/tcp_output.c", function "tcp_transmit_skb()", like this: ... TCP_ECN_send(sk, tp, skb, tcp_header_size); } tp->af_specific->send_check(sk, th, skb->len, skb); if (tcb->flags & TCPCB_FLAG_ACK) tcp_event_ack_sent(sk); ... The "e100" driver, which declares the capability NETIF_F_HW_CSUM, in the version shipped with the 2.4.20 kernel used to handle downstream checksum offloading like this ("drivers/net/e100/e100_main.c", function "e100_prepare_xmit_buff()"): if (skb->ip_summed == CHECKSUM_HW) { const struct iphdr *ip = skb->nh.iph; if ((ip->protocol == IPPROTO_TCP) || (ip->protocol == IPPROTO_UDP)) { u16 *chksum; tcb->tcbu.ipcb.ip_activation_high = IPCB_HARDWAREPARSING_ENABLE; tcb->tcbu.ipcb.ip_schedule |= IPCB_TCPUDP_CHECKSUM_ENABLE; if (ip->protocol == IPPROTO_TCP) { struct tcphdr *tcp; tcp = (struct tcphdr *) ((u32 *) ip + ip->ihl); chksum = &(tcp->check); tcb->tcbu.ipcb.ip_schedule |= IPCB_TCP_PACKET; } else { struct udphdr *udp; udp = (struct udphdr *) ((u32 *) ip + ip->ihl); chksum = &(udp->check); } *chksum = e100_pseudo_hdr_csum(ip); } } which **is wrong** since it recalculates the transport-PDU pseudo-header checksum that has already been calculated by the networking stack. In latter versions, this was fixed. * Upstream (receive-path) checksum offloading: The transport-PDU checksum of an incoming TCP-carrying checksum-offloaded packet is checked only if "skb->ip_summed" is not "CHECKSUM_UNNECESSARY", in "net/ipv4/tcp_ipv4.c", function "tcp_v4_rcv()" like this: if ((skb->ip_summed != CHECKSUM_UNNECESSARY && tcp_v4_checksum_init(skb) < 0)) goto bad_packet; "tcp_v4_checksum_init()" is implemented in the same file and goes like this: <reformatted> static int tcp_v4_checksum_init(struct sk_buff *skb) { if (skb->ip_summed == CHECKSUM_HW) { skb->ip_summed = CHECKSUM_UNNECESSARY; if (!tcp_v4_check(skb->h.th,skb->len,skb->nh.iph->saddr, skb->nh.iph->daddr,skb->csum)) return 0; NETDEBUG(if (net_ratelimit()) printk(KERN_DEBUG "hw tcp v4 csum failed\n")); skb->ip_summed = CHECKSUM_NONE; } if (skb->len <= 76) { if (tcp_v4_check(skb->h.th,skb->len,skb->nh.iph->saddr, skb->nh.iph->daddr, skb_checksum(skb, 0, skb->len, 0))) return -1; skb->ip_summed = CHECKSUM_UNNECESSARY; } else { skb->csum = ~tcp_v4_check(skb->h.th,skb->len, skb->nh.iph->saddr, skb->nh.iph->daddr,0); } return 0; } Notices that this functions fully-resolves the checksum verification issue only for packets that are checksum-offloaded, and for small packets. In all other cases it just calculates and stores in "skb->csum" the pseudo-header checksum; full checksum calculation if deferred for latter. The "e100" driver, handles upstream checksum offloading in "drivers/net/e100/e100_main.c" function "e100_rx_srv()", like this: /* set the checksum info */ if (bdp->flags & DF_CSUM_OFFLOAD) { if (bdp->rev_id >= D102_REV_ID) { skb->ip_summed = e100_D102_check_checksum(rfd); } else { skb->ip_summed = e100_D101M_checksum(bdp, skb); } } else { skb->ip_summed = CHECKSUM_NONE; } the "e100_D102" variant operates in "opaque mode" like this: /* Check the D102 RFD flags to see if the checksum passed */ static unsigned char e100_D102_check_checksum(rfd_t *rfd) { if (((le16_to_cpu(rfd->rfd_header.cb_status)) & RFD_PARSE_BIT) && (((rfd->rcvparserstatus & CHECKSUM_PROTOCOL_MASK) == RFD_TCP_PACKET) || ((rfd->rcvparserstatus & CHECKSUM_PROTOCOL_MASK) == RFD_UDP_PACKET)) && (rfd->checksumstatus & TCPUDP_CHECKSUM_BIT_VALID) && (rfd->checksumstatus & TCPUDP_CHECKSUM_VALID)) { return CHECKSUM_UNNECESSARY; } return CHECKSUM_NONE; } while the "e100_D101M", operates in "smart mode": static unsigned char e100_D101M_checksum(struct e100_private *bdp, struct sk_buff *skb) { unsigned short proto = (skb->protocol); if (proto == __constant_htons(ETH_P_IP)) { skb->csum = get_unaligned((u16 *) (skb->tail)); return CHECKSUM_HW; } return CHECKSUM_NONE; } Happy Hacking /npat -- When it is incorrect, it is, at least *authoritatively* incorrect -- Douglas Adams, The Hitchhiker's Guide to the Galaxy - : send the line "unsubscribe linux-net" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html