Re: Are "skb->data" physically continuous?

Nick Patavalis <npat@inaccessnetworks.com> · Mon, 15 Sep 2003 13:26:47 +0300

On Sun, Sep 14, 2003 at 02:40:54PM +0100, Jamie Lokier wrote:
> Shmulik Hen wrote:
> > On Sunday 14 September 2003 01:41 am, Nick Patavalis wrote:
> > > this assumption, but I have also heard that "zero-copy" networking
> > > was added to the kernel at some point. Zero-copy indicates that
> > > data come directly for user-space and, hence, they might be
> > > non-continuous.
> > 
> > You may want to take a look at e100_main.c in one of the latest 2.4.x 
> > kernels. There you should be able to see how to deal with 
> > dev->features and the flags NETIF_F_SG for scatter-gather 
> > capabilities and NETIF_F_*_CSUM for checksum offloading capabilities.
> > Zero-copy was added in 2.4.4, and is a combination of the above. Also, 
> > take a look at skbuff.h for MAX_SKB_FRAGS and struct skb_shared_info 
> > and their use in the kernel code.
> 
> In case it wasn't clear, if you _don't_ set those NETIF_* flags, then
> your driver is always passed contiguous data.
> performance penalty.
> 

Hen and Jamie, 

Thanks a lot for your very helpful replies. I took a look at the
places you suggested in order to find out how a driver supporting
scatter-gather should be coded. With the hope that others might find
this useful, I'm sending a rather longish description of what I found
out. I hope that there are not too many misconceptions, or that,
someone will point them out, if there are.

** Features of a Networking Driver / Device.

The "net_device" structure (defined in "include/linux/netdevice.h"),
which is filled-in by a net driver at initialization time, includes a
field called "features". By setting certain bits in this field the
driver can inform the networking stack of it's capabilities. As of
2.4.20 the following features-masks are defined (in
"include/linux/netdevice.h"), and can be declared by the driver:

    NETIF_F_SG
        Scatter/gather IO.

    NETIF_F_IP_CSUM
        Can checksum only TCP/UDP over IPv4.

    NETIF_F_NO_CSUM
        Does not require checksum. F.e. loopack.

    NETIF_F_HW_CSUM
        Can checksum all the packets.

    NETIF_F_DYNALLOC
        Self-dectructable device.

    NETIF_F_HIGHDMA
        Can DMA to high memory.

    NETIF_F_FRAGLIST   <------------------- ??? WHAT IS THIS ???
        Scatter/gather IO.

    NETIF_F_HW_VLAN_TX
        Transmit VLAN hw acceleration

    NETIF_F_HW_VLAN_RX
        Receive VLAN hw acceleration

    NETIF_F_HW_VLAN_FILTER
        Receive filtering on VLAN

    NETIF_F_VLAN_CHALLENGED
        Device cannot handle VLAN packets

** Scatter-Gather DMA

Among the feature bits, shown above, the "NETIF_F_SG" is the one the
driver sets to indicate that it can do scatter-gather DMA. If
"NETIF_F_SG" is not set, then the networking stack will make sure that
the "skb"s hold *physically-continuous* data before passing them to
the driver. This is taken care of in "net/core/dev.c:dev_queue_xmit()"
like this:

    if (skb_shinfo(skb)->frag_list &&
        !(dev->features&NETIF_F_FRAGLIST) &&
        skb_linearize(skb, GFP_ATOMIC) != 0) {
            kfree_skb(skb);
            return -ENOMEM;
    }

    /* Fragmented skb is linearized if device does not support SG,
     * or if at least one of fragments is in highmem and device
     * does not support DMA from it.
     */
    if (skb_shinfo(skb)->nr_frags &&
        (!(dev->features&NETIF_F_SG) || illegal_highdma(dev, skb)) &&
        skb_linearize(skb, GFP_ATOMIC) != 0) {
            kfree_skb(skb);
            return -ENOMEM;
    }

As a result, when a driver's "hard_start_xmit()" function receives an
skb, it knows that the data to be transmitted start at "skb->data",
that their length is "skb->len", and that they are virtually and
physically continuous. As a result the driver can directly pass the
"skb->data" pointer to the device's DMA controller, after converting
it to a physical address, and synchronizing the relevant cache entries
(by calling something like "pci_map_single()").

If---on the other hand---the driver sets the "NETIF_F_SG" bit in the
"features" field of the "net_device" structure (declaring that it
*can* do scatter-gather DMA), then any skb passed to it, might
very-well hold data that are not physically continuous (and sometimes
not even virtually continuous). In this case for every "skb" passed to
the driver the networking stack also fills-in a "skb_shared_info"
structure, defined in "include/linux/skbuff.h", like this:

    struct skb_shared_info {
            atomic_t        dataref;
            unsigned int    nr_frags;
            struct sk_buff  *frag_list;
            skb_frag_t      frags[MAX_SKB_FRAGS];
    };

This structure is pointed-to by the "end" field of the "sk_buff"
structure, so it can be accessed by the driver as:

    (struct skb_shared_info *)skb->end

or even better using the macro "skb_shinfo", which is essentially the
same:

    skb_shinfo(skb)

It should by obvious that, in the scatter-gather case, the frame to be
transmitted consist of a sequence of fragments (parts), each of which
keeps a virtually and physically continuous subset of the data. The
start of the first fragment is pointed by "skb->data" (as in the
non-SG case), but its length (in bytes) is "skb->len - skb->data_len"
(wich can also be accessed using the macto "skb_headlen()" defined in
"include/linux/skbuff.h"). "skb->len" is still the length of the
*full* frame (the sum of the lengths of all the fragments), and
"skb->data_len" is the total length of all the data fragments not
counting the first "header" fragment pointed by "skb->data". Actually
a way to check if an skb is physically-continuous is to test if
"skb->data_len" is non-zero; there is even a macro for this
("skb_is_nonlinear()") defined in "include/linux/skbuff.h". After the
initial "header" fragment, there are exactly
"skb_shinfo(skb)->nr_frags" fragments following. Each of these
fragments is described by a "skb_frag_t" structure defined (in
"include/linux/skbuff.h") as:

    struct skb_frag_struct
    {
            struct page *page;
            __u16 page_offset;
            __u16 size;
    };

    ...

    typedef struct skb_frag_struct skb_frag_t;

The "skb_frag_struct" structure corresponding to the I'th fragment can
be accessed as:

    skb_shinfo(skb)->frags[I];

So the data of a non-linear "sk_buff" "skb" consist of the following
parts, which are themselves linear (virtually and physically
continuous):

            addr of       addr of
  part # :  first byte    last byte
  -----------------------------------------
  0      :  skb->data ... skb->len - skb->data_len - 1
  1      :  fr_adr(0) ... fr_adr(0) + fr_sz(0) - 1
  .
  .
  nfrags : fr_adr(nfrags - 1) 
                      ... fr_adr(nfrags - 1) + fr_sz(nfrags - 1) - 1

where:

   "nfrags" is "skb_shinfo(skb)->nr_frags"

and

   "fr_adr(i)" is "fr_pg_adr(i) + fr_pg_ofs(i)"
     "fr_pg_adr(i)" is "page_address(skb_shinfo(skb)->frags[i].page)"
     "fr_pg_ofs(i)" is "skb_shinfo(skb)->frags[i].page_offset"
   "fr_sz(i)" is "skb_shinfo(skb)->frags[i].size"

   NOTICE: "fr_adr", "fr_sz", "fr_pg_adr", and "fr_pg_ofs" are just
     symbolisms introduced to convenience out discussion, they are not
     actually defined as macros in the kernel. "page_address", on the
     other hand, is a real macro defined in "include/linux/mm.h"

it also holds that:

  skb->data_len == fr_sz(0) + ... + fr_sz(nfrags - 1)

For an example of how these are used in a real driver see
"e100_main.c", and especially the function "e100_prepare_xmit_buff()"
which contains all the details of handling the fragment-sequence.

/npat

-- 
But the delight and pride of Aule is in the deed of making, and in the
thing made, and neither in possession nor in his own mastery;
wherefore he gives and hoards not, and is free from care, passing ever
on to some new work."
  -- J.R.R. Tolkien, Ainulindale (Silmarillion)

-
: send the line "unsubscribe linux-net" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html