net_device features (was: "Are "skb->data" physically continuous?")

Nick Patavalis <npat@inaccessnetworks.com> · Wed, 17 Sep 2003 19:50:44 +0300

Thanks to your helpfull replies, I think I have now reached to a
rather consistent understanding regarding the operation of the
checksum-offloading, and scather-gather features found in the linux
networking stack. Follows the same scatter-gather description I sent a
couple of days ago polished-up a bit, and extented to also cover the
checksum-offloading issues. Thanks again and sorry for my---sometimes
silly---questions.

Please excuse my linguistic ... atrocities (as English is not my
native tongue) and feel free to ask questions, send corrections and
suggestions, comment, or rant about it.

Here it goes...

**
** Features of a Networking Driver / Device
**

The "net_device" structure (defined in "include/linux/netdevice.h"),
which is filled-in by a net driver at initialization time, includes a
field called "features". By setting certain bits in this field the
driver can inform the networking stack of it's capabilities. As of
2.4.20 the following features-masks are defined (in
"include/linux/netdevice.h"), and can be declared by the driver:

    NETIF_F_SG
        Scatter/gather IO.

    NETIF_F_IP_CSUM
        Can checksum only TCP/UDP over IPv4.

    NETIF_F_NO_CSUM
        Does not require checksum. F.e. loopack.

    NETIF_F_HW_CSUM
        Can checksum all the packets.

    NETIF_F_DYNALLOC
        Self-dectructable device.

    NETIF_F_HIGHDMA
        Can DMA to high memory.

    NETIF_F_FRAGLIST
        Scatter/gather IO.

    NETIF_F_HW_VLAN_TX
        Transmit VLAN hw acceleration

    NETIF_F_HW_VLAN_RX
        Receive VLAN hw acceleration

    NETIF_F_HW_VLAN_FILTER
        Receive filtering on VLAN

    NETIF_F_VLAN_CHALLENGED
        Device cannot handle VLAN packets

Follows a rather detailed description of the implications for the
device-driver writers of the following features:

  NETIF_F_SG
  NETIF_F_NO_CSUM
  NETIF_F_IP_CSUM
  NETIF_F_HW_CSUM

In summary: NETIF_F_SG must be enabled by drivers that are willing and
able to hande "skb"s whose data are not physically continuous
(i.e. that are fragmented). NETIF_F_NO_CSUM must be enabled by drivers
servicing communication paths that are by-nature reliable, so
checksums are not needed to protect the data from transmission
errors. NETIF_F_IP_CSUM is for drivers and devices that can perform
(presumably hardware-assisted) checksum-calculations *only* for TCP
and UDP packets over IPv4. Finally NETIF_F_HW_CSUM is for drivers and
devices that can perform hardware-assisted checksum calculations for
all kinds of packets.

**
** Scatter-Gather: the "NETIF_F_SG" feature
**

Among the feature bits, shown above, the "NETIF_F_SG" is the one the
driver sets to indicate that it can do scatter-gather
packet-processing. If "NETIF_F_SG" is not set, then the networking
stack will make sure that the "skb"s hold *physically-continuous* data
before passing them to the driver. This is taken care of in
"net/core/dev.c:dev_queue_xmit()" like this:

    if (skb_shinfo(skb)->frag_list &&
        !(dev->features&NETIF_F_FRAGLIST) &&
        skb_linearize(skb, GFP_ATOMIC) != 0) {
            kfree_skb(skb);
            return -ENOMEM;
    }

    /* Fragmented skb is linearized if device does not support SG,
     * or if at least one of fragments is in highmem and device
     * does not support DMA from it.
     */
    if (skb_shinfo(skb)->nr_frags &&
        (!(dev->features&NETIF_F_SG) || illegal_highdma(dev, skb)) &&
        skb_linearize(skb, GFP_ATOMIC) != 0) {
            kfree_skb(skb);
            return -ENOMEM;
    }

As a result, when a driver's "hard_start_xmit()" function receives an
skb, it knows that the data to be transmitted start at "skb->data",
that their length is "skb->len", and that they are virtually and
physically continuous. As a result the driver can directly pass the
"skb->data" pointer to the device's DMA controller, after converting
it to a physical address, and synchronizing the relevant cache entries
(by calling something like "pci_map_single()").

If---on the other hand---the driver sets the "NETIF_F_SG" bit in the
"features" field of the "net_device" structure (declaring that it
*can* do scatter-gather DMA), then any skb passed to it, might
very-well hold data that are not physically continuous (and sometimes
not even virtually continuous). In this case for every "skb" passed to
the driver the networking stack also fills-in a "skb_shared_info"
structure, defined in "include/linux/skbuff.h", like this:

    struct skb_shared_info {
            atomic_t        dataref;
            unsigned int    nr_frags;
            struct sk_buff  *frag_list;
            skb_frag_t      frags[MAX_SKB_FRAGS];
    };

This structure is pointed-to by the "end" field of the "sk_buff"
structure, so it can be accessed by the driver as:

    (struct skb_shared_info *)skb->end

or even better using the macro "skb_shinfo", which is essentially the
same:

    skb_shinfo(skb)

It should be obvious that, in the scatter-gather case, the frame to be
transmitted consist of a sequence of fragments (parts), each of which
keeps a virtually and physically continuous subset of the data. The
start of the first fragment is pointed by "skb->data" (as in the
non-SG case), but its length (in bytes) is "skb->len - skb->data_len"
(which can also be accessed using the macro "skb_headlen()" defined in
"include/linux/skbuff.h"). "skb->len" is still the length of the
*full* frame (the sum of the lengths of all the fragments), and
"skb->data_len" is the total length of all the data fragments not
counting the first "header" fragment pointed by "skb->data". Actually
a way to check if an skb is physically-continuous is to test if
"skb->data_len" is non-zero; there is even a macro for this
("skb_is_nonlinear()") defined in "include/linux/skbuff.h". After the
initial "header" fragment, there are exactly
"skb_shinfo(skb)->nr_frags" fragments following. Each of these
fragments is described by a "skb_frag_t" structure defined (in
"include/linux/skbuff.h") as:

    struct skb_frag_struct
    {
            struct page *page;
            __u16 page_offset;
            __u16 size;
    };

    ...

    typedef struct skb_frag_struct skb_frag_t;

The "skb_frag_struct" structure corresponding to the I'th fragment can
be accessed as:

    skb_shinfo(skb)->frags[I];

So the data of a non-linear "sk_buff" "skb" consist of the following
parts, which are themselves linear (virtually and physically
continuous):

            virtual       virtual
            addr of       addr of
  part # :  first byte    last byte
  -----------------------------------------
  0      :  skb->data ... skb->len - skb->data_len - 1
  1      :  fr_adr(0) ... fr_adr(0) + fr_sz(0) - 1
  .
  .
  nfrags : fr_adr(nfrags - 1) 
                      ... fr_adr(nfrags - 1) + fr_sz(nfrags - 1) - 1

where:

   "nfrags" is "skb_shinfo(skb)->nr_frags"

and

   "fr_adr(i)" is "fr_pg_adr(i) + fr_pg_ofs(i)"
     "fr_pg_adr(i)" is "page_address(skb_shinfo(skb)->frags[i].page)"
     "fr_pg_ofs(i)" is "skb_shinfo(skb)->frags[i].page_offset"
   "fr_sz(i)" is "skb_shinfo(skb)->frags[i].size"

   NOTICE: "fr_adr", "fr_sz", "fr_pg_adr", and "fr_pg_ofs" are just
     symbolisms introduced to convenience our discussion, they are not
     actually defined as macros in the kernel. "page_address", on the
     other hand, is a real macro defined in "include/linux/mm.h"

it also holds that:

  skb->data_len == fr_sz(0) + ... + fr_sz(nfrags - 1)

For an example of how these are used in a real driver see
"e100_main.c", and especially the function "e100_prepare_xmit_buff()"
which contains all the details of handling the fragment-sequence.

**
** Checksum offloading: the "NETIF_F_??_CSUM" features
**

Checksum offloading is used to relief the kernel (and thus the CPU)
from the burden of calculating transport-PDU checksums (TCP and UDP
checksums) when the NIC hardware can perform these calculations
itself. It can be used for both: downstream (transmitted) packets,
where the NIC calculates and embeds the checksum in the PDU, and
upstream (received) packets where the NIC pre-calculates and passes to
the networking-stack the transport-PDU checksum of the received
packet. The checksums of lower-layer PDUs do not come into the
checksum-offloading process, since the kernel always calculates the
network-layer checksums (IP header checksums) itself.

The two directions are treated differently by the kernel (and it would
be possible for a device to support checksum-offloading in one of them
only), so they will be described separately here, starting with the
downstream direction.

** Downstream (transmit-path) checksum offloading

In order for the networking stack to know that a device can, and is
willing to, provide checksum offloading services downstream, the
driver must---at initialization time---set either the
"NETIF_F_IP_CSUM" or the "NETIF_F_HW_CSUM" flag in the "features"
field of the "net_device" structure. If the "NETIF_F_HW_CSUM" flag is
set, then the kernel will suppress the calculation of transport-PDU
checksums for all kinds of packets, and will set the "skb->ip_summed"
field of the "skb"s delivered to the driver to
"CHECKSUM_HW". Furthermore the "skb->csum" field will be set to the
byte-offset of the checksum-field in the transport-PDU header. In order
to help the hardware checksum-calculation unit, the kernel will also
compute the pseudo-header transport-PDU checksum and store it in the
checksum field inside the PDU. The "NETIF_F_IP_CSUM" case is almost
identical, with the only difference that the kernel will instead fully
calculate transport-PDU checksums and set "skb->ip_summed" to
CHECKSUM_NONE, for any packet other than the ones carrying TCP and UDP
PDUs.

Another more ... radical checksum-offloading mode, can be indicated by
setting the "NETIF_F_NO_CSUM" flag in "dev->features". In this case
the kernel will completely suppress checksum calculation, under the
assumption that the physical medium used by the device is totally
reliable, and no checksum protection is required. Furthermore the
kernel will set the "skb->ip_summed" field of the skbs delivered to
the driver to "CHECKSUM_UNNECESSARY". This mode must only be used for
devices that do not actually transmit packets to real-world media
(e.g. the loopback device).

In summary (from the kernel's viewpoint):

  NETIF_F_NO_CSUM set:
      no checksum is calculated, the medium is considered reliable.

  NETIF_F_IP_CSUM set: Only pseudo-header checksum is calculated for
      the transport-PDU of TCP and UDP packets. Full transport-PDU
      checksum is calculated for any other packet.

  NETIF_F_HW_CSUM set: Only pseudo-header transport-PDU checksum is
      calculated for all packets (where it makes sense).

>From a driver's point of view, the following are possible for an "skb"
delivered to it by the stack:

   skb->ip_summed == CHECKSUM_UNNECESSARY:
      skb->h.th->check <-- invalid
      skb->csum        <-- invalid
      ( the device is considered reliable, no checksum should be 
        calculated. transmit the packet as-is. )

   skb->ip_summed == CHECKSUM_HW:
      skb->h.th->check <-- pseudo-header checksum of 
                           the transport-PDU
      skb->csum        <-- offset of the transport-PDU checksum 
                           field from the beginning of the
                           transport-PDU header. That is:
                           sbk->csum = skb->h.th->check - skb->h.raw 
      ( the driver / device must complete the PDU checksum
        calculation, and store the result in 
        skb->h.raw + skb->csum, before transmitting the packet )

   skb->ip_summed == CHECKSUM_NONE:
      skb->h.th->check <-- fully calculated transport-PDU checksum
      skb->csum        <-- invalid
      ( all checksum calculations have been performed by the
        kernel. Just transmit the packet. )

What types of packets are actually delivered to a driver (and
therefore what types of packets the driver must be prepared to handle)
depends on the features it has declared. Namely:

    NETIF_F_NO_CSUM: only CHECKSUM_UNNECESSARY packets.

    NETIF_F_IP_CSUM: CHECKSUM_NONE or CHECKSUM_HW packets. Furthermore
      CHECKSUM_HW packets will only contain TCP or IP transport PDUs.

    NETIF_F_HW_CSUM: CHECKSUM_NONE or CHECKSUM_HW packets.

An example of a real-world driver that advertises, and handles, the
NETIF_IF_HW_CSUM capability is the e100 driver found in
"drivers/net/e100/". The downstream checksum-offloading features are
treated in "drivers/net/e100/e100_main.c:e100_prepare_xmit_buff()".

** Upstream (receive-path) checksum offloading

Now let's turn our attention to upstream (receive-path) checksum
offloading. It is essential to point-out that the capabilities
declared through "dev->features" have nothing to do with the upstream
handling of checksum offloading; those are only relevant for the
downstream path.

In the eyes of the networking stack, and as far as upstream checksum
offloading is concerned, networking devices belong in three classes,
according to their capabilities and operational mode.

The first class is "dumb" devices, that can do no hardware checksum
calculation at all. For every packet received by such devices the
respective driver must set the "skb->ip_summed" field to
"CHECKSUM_NONE" before delivering the skb to the stack. The network
stack will then calculate and verify (try to match) the transport-PDU
checksum, and act accordingly.

The second class is devices that can "opaquely" calculate and verify
any checksums in the incoming packets and report success-or-failure
to the driver (without actually reporting the values of any of the
calculated checksums---hence "opaquely"). For such devices, before
passing the skb to the network stack, the driver must set
"skb->ip_summed" to "CHECKSUM_UNNECESSARY" if the device reported
success ("packet is good"), and to "CHECKSUM_NONE" if it reported
failure ("packet is bad", or "I can't parse this"). It is interesting
to note that in this later case the transport-PDU checksum will be
actually re-calculated by the kernel (although the packet is a-priori
known to be bad), and only then the packet will be dropped.  The
disadvantage of this approach is that packets flagged as "bogus" by the
hardware, are still considered by the kernel, and waste some
resources. The advantage is that devices that claim to be able to
calculate the checksums of all possible packets (of past, present and
future transport-layer protocols) are probably lying. Additionally it
is not impossible to imagine a device that confuses the "packet is
bad" with the "can't parse this" verdicts. Under this light, giving
bad-looking packets a second-chance may not be a very bad idea.

The third class is "smart" devices that are able to calculate, and
make available to the CPU, a partial-checksum of the transport-layer
PDU. Specifically the partial-checksum that must be calculated by the
device must cover (for TCP and UDP transports) the PDU-header and the
PDU-text, but *not* the pseudo-header. For these devices the driver
must obtain the hardware-calculated partial-checksum and store it in
"skb->csum", then set "skb->ip_summed" to CHECKSUM_HW, and finally push
the packet upstream. The network stack will then complete the checksum
in "skb->csum" by calculating---on top of it---the pseudo-header
checksum, and if everything turns-out ok, accept the packet. Otherwise
the stack will recalculate the PDU checksum from scratch, and if it
still fails, drop the packet.

Summarizing, the operations that need to be performed by a driver for
upstream checksum offloading are:

  Dump devices:

      - set "skb->ip_summed" to "CHECKSUM_NONE"

      - push the packet upstream

  Opaque devices:

      - If the device reported success ("good packet")

           set "skb->ip_summed" to "CHECKSUM_UNNECESSARY"

        else, if the device reported failure ("bad packet" or "I can't
        parse this")

           set "skb->ip_summed" to "CHECKSUM_NONE"

      - push the packet upstream

  Smart devices:

      - Obtain the partial transport-PDU checksum from the device (it
        must be the checksum of the PDU-header and the PDU-data,
        without taking into account the pseudo-header)

      - set "skb->csum" to the partial checksum calculated by the NIC
        hardware

      - set "skb->ip_summed" to "CHECKSUM_HW"

The driver for the "e100" NIC included in the kernel sources, handles
upstream checksum offloading for "opaque" and "smart" variants of the
e100 adapter. Examples can be sought in "drivers/net/e100_main.c", and
specifically in function "e100_rx_srv()". See also the "forensic
evidence" section just below.

**
** Forensic evidence (collected from the crime-scene of 2.4.20)
**

* Downstream (transmit-path) checksum offloading:

If the driver does not advertise compatible checksum-offloading
features, then the packet is checksumed by the kernel, as can be seen
in: "net/core/dev:dev_queue_xmit()"

    /* If packet is not checksummed and device does not support
     * checksumming for this protocol, complete checksumming here.
     */
    if (skb->ip_summed == CHECKSUM_HW &&
        (!(dev->features&(NETIF_F_HW_CSUM|NETIF_F_NO_CSUM)) &&
         (!(dev->features&NETIF_F_IP_CSUM) ||
          skb->protocol != htons(ETH_P_IP)))) {
            if ((skb = skb_checksum_help(skb)) == NULL)
                    return -ENOMEM;
    }

An observant reader who traces this code down the
"skb_checksum_help()" execution path, will see that actually only the
TCP header and TCP text are checksumed, but *not* the pseudo-header as
prescribed by RFC-739 and RFC-768. This is because the pseudo-header
is always checksumed higher-up the network layer. For TCP packets
this happens in the function "net/ipv4/tcp_ipv4.c:tcp_v4_send_check()":

    <reformatted>

    /* This routine computes an IPv4 TCP checksum. */
    void tcp_v4_send_check(struct sock *sk, struct tcphdr *th, int len, 
                           struct sk_buff *skb)
    {
        if (skb->ip_summed == CHECKSUM_HW) {
            th->check = ~tcp_v4_check(th, len, sk->saddr, sk->daddr, 0);
            skb->csum = offsetof(struct tcphdr, check);
        } else {
            th->check = tcp_v4_check(th, len, sk->saddr, sk->daddr,
                                     csum_partial((char *)th, 
                                                  th->doff<<2, 
                                                  skb->csum));
        }
    }

As you can see, if ip_summed == CHECKSUM_HW, then the pseudo-header
checksum is calculated, and "skb->csum" is set to the offset of the
checksum-field from the beginning of the TCP header. Otherwise, the
the checksum of the pseudo-header, and of the TCP-header are
calculated on-top of "skb->csum", which presumably holds the
text-checksum computed earlier. "tcp_v4_send_check()" is called
through the jump-table structure "ipv4_specific" which is initialized
in "net/ipv4/tcp_ipv4.c", like this:

    struct tcp_func ipv4_specific = {
        ip_queue_xmit,
        tcp_v4_send_check,
        tcp_v4_rebuild_header,

it is called from "net/ipv4/tcp_output.c", function
"tcp_transmit_skb()", like this:

                     ...
            TCP_ECN_send(sk, tp, skb, tcp_header_size);
    }
    tp->af_specific->send_check(sk, th, skb->len, skb);

    if (tcb->flags & TCPCB_FLAG_ACK)
            tcp_event_ack_sent(sk);
                     ...

The "e100" driver, which declares the capability NETIF_F_HW_CSUM, in
the version shipped with the 2.4.20 kernel used to handle downstream
checksum offloading like this ("drivers/net/e100/e100_main.c",
function "e100_prepare_xmit_buff()"):

    if (skb->ip_summed == CHECKSUM_HW) {
        const struct iphdr *ip = skb->nh.iph;

        if ((ip->protocol == IPPROTO_TCP) ||
            (ip->protocol == IPPROTO_UDP)) {
                u16 *chksum;

                tcb->tcbu.ipcb.ip_activation_high =
                    IPCB_HARDWAREPARSING_ENABLE;
                tcb->tcbu.ipcb.ip_schedule |=
                    IPCB_TCPUDP_CHECKSUM_ENABLE;

                if (ip->protocol == IPPROTO_TCP) {
                    struct tcphdr *tcp;

                    tcp = (struct tcphdr *) ((u32 *) ip + ip->ihl);
                    chksum = &(tcp->check);
                    tcb->tcbu.ipcb.ip_schedule |= IPCB_TCP_PACKET;
                } else {
                    struct udphdr *udp;

                    udp = (struct udphdr *) ((u32 *) ip + ip->ihl);
                    chksum = &(udp->check);
                }

            *chksum = e100_pseudo_hdr_csum(ip);
        }
    }

which **is wrong** since it recalculates the transport-PDU
pseudo-header checksum that has already been calculated by the
networking stack. In latter versions, this was fixed.

* Upstream (receive-path) checksum offloading:

The transport-PDU checksum of an incoming TCP-carrying
checksum-offloaded packet is checked only if "skb->ip_summed" is not
"CHECKSUM_UNNECESSARY", in "net/ipv4/tcp_ipv4.c", function
"tcp_v4_rcv()" like this:

    if ((skb->ip_summed != CHECKSUM_UNNECESSARY &&
         tcp_v4_checksum_init(skb) < 0))
            goto bad_packet;

"tcp_v4_checksum_init()" is implemented in the same file and goes like
this:

    <reformatted>

    static int tcp_v4_checksum_init(struct sk_buff *skb)
    {
        if (skb->ip_summed == CHECKSUM_HW) {
            skb->ip_summed = CHECKSUM_UNNECESSARY;
            if (!tcp_v4_check(skb->h.th,skb->len,skb->nh.iph->saddr,
                              skb->nh.iph->daddr,skb->csum))
                    return 0;

            NETDEBUG(if (net_ratelimit()) 
                printk(KERN_DEBUG "hw tcp v4 csum failed\n"));
            skb->ip_summed = CHECKSUM_NONE;
        }
        if (skb->len <= 76) {
            if (tcp_v4_check(skb->h.th,skb->len,skb->nh.iph->saddr,
                             skb->nh.iph->daddr,
                             skb_checksum(skb, 0, skb->len, 0)))
                    return -1;
            skb->ip_summed = CHECKSUM_UNNECESSARY;
        } else {
            skb->csum = ~tcp_v4_check(skb->h.th,skb->len,
                                      skb->nh.iph->saddr,
                                      skb->nh.iph->daddr,0);
        }
        return 0;
    }

Notices that this functions fully-resolves the checksum verification
issue only for packets that are checksum-offloaded, and for small
packets. In all other cases it just calculates and stores in
"skb->csum" the pseudo-header checksum; full checksum calculation if
deferred for latter.

The "e100" driver, handles upstream checksum offloading in
"drivers/net/e100/e100_main.c" function "e100_rx_srv()", like this:

     /* set the checksum info */
     if (bdp->flags & DF_CSUM_OFFLOAD) {
             if (bdp->rev_id >= D102_REV_ID) {
                     skb->ip_summed = e100_D102_check_checksum(rfd);
             } else {
                     skb->ip_summed = e100_D101M_checksum(bdp, skb);
             }
     } else {
             skb->ip_summed = CHECKSUM_NONE;
     }

the "e100_D102" variant operates in "opaque mode" like this:

    /* Check the D102 RFD flags to see if the checksum passed */
    static unsigned char
    e100_D102_check_checksum(rfd_t *rfd)
    {
        if (((le16_to_cpu(rfd->rfd_header.cb_status)) & RFD_PARSE_BIT)
            && (((rfd->rcvparserstatus & CHECKSUM_PROTOCOL_MASK) ==
                 RFD_TCP_PACKET)
                || ((rfd->rcvparserstatus & CHECKSUM_PROTOCOL_MASK) ==
                    RFD_UDP_PACKET))
            && (rfd->checksumstatus & TCPUDP_CHECKSUM_BIT_VALID)
            && (rfd->checksumstatus & TCPUDP_CHECKSUM_VALID)) {
                return CHECKSUM_UNNECESSARY;
        }
        return CHECKSUM_NONE;
    }

while the "e100_D101M", operates in "smart mode":

    static unsigned char
    e100_D101M_checksum(struct e100_private *bdp, struct sk_buff *skb)
    {
        unsigned short proto = (skb->protocol);

        if (proto == __constant_htons(ETH_P_IP)) {

                skb->csum = get_unaligned((u16 *) (skb->tail));
                return CHECKSUM_HW;
        }
        return CHECKSUM_NONE;
    }

Happy Hacking
/npat

-- 
When it is incorrect, it is, at least *authoritatively* incorrect
  -- Douglas Adams, The Hitchhiker's Guide to the Galaxy
-
: send the line "unsubscribe linux-net" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html