On Sun, Sep 14, 2003 at 02:40:54PM +0100, Jamie Lokier wrote: > Shmulik Hen wrote: > > On Sunday 14 September 2003 01:41 am, Nick Patavalis wrote: > > > this assumption, but I have also heard that "zero-copy" networking > > > was added to the kernel at some point. Zero-copy indicates that > > > data come directly for user-space and, hence, they might be > > > non-continuous. > > > > You may want to take a look at e100_main.c in one of the latest 2.4.x > > kernels. There you should be able to see how to deal with > > dev->features and the flags NETIF_F_SG for scatter-gather > > capabilities and NETIF_F_*_CSUM for checksum offloading capabilities. > > Zero-copy was added in 2.4.4, and is a combination of the above. Also, > > take a look at skbuff.h for MAX_SKB_FRAGS and struct skb_shared_info > > and their use in the kernel code. > > In case it wasn't clear, if you _don't_ set those NETIF_* flags, then > your driver is always passed contiguous data. > performance penalty. > Hen and Jamie, Thanks a lot for your very helpful replies. I took a look at the places you suggested in order to find out how a driver supporting scatter-gather should be coded. With the hope that others might find this useful, I'm sending a rather longish description of what I found out. I hope that there are not too many misconceptions, or that, someone will point them out, if there are. ** Features of a Networking Driver / Device. The "net_device" structure (defined in "include/linux/netdevice.h"), which is filled-in by a net driver at initialization time, includes a field called "features". By setting certain bits in this field the driver can inform the networking stack of it's capabilities. As of 2.4.20 the following features-masks are defined (in "include/linux/netdevice.h"), and can be declared by the driver: NETIF_F_SG Scatter/gather IO. NETIF_F_IP_CSUM Can checksum only TCP/UDP over IPv4. NETIF_F_NO_CSUM Does not require checksum. F.e. loopack. NETIF_F_HW_CSUM Can checksum all the packets. NETIF_F_DYNALLOC Self-dectructable device. NETIF_F_HIGHDMA Can DMA to high memory. NETIF_F_FRAGLIST <------------------- ??? WHAT IS THIS ??? Scatter/gather IO. NETIF_F_HW_VLAN_TX Transmit VLAN hw acceleration NETIF_F_HW_VLAN_RX Receive VLAN hw acceleration NETIF_F_HW_VLAN_FILTER Receive filtering on VLAN NETIF_F_VLAN_CHALLENGED Device cannot handle VLAN packets ** Scatter-Gather DMA Among the feature bits, shown above, the "NETIF_F_SG" is the one the driver sets to indicate that it can do scatter-gather DMA. If "NETIF_F_SG" is not set, then the networking stack will make sure that the "skb"s hold *physically-continuous* data before passing them to the driver. This is taken care of in "net/core/dev.c:dev_queue_xmit()" like this: if (skb_shinfo(skb)->frag_list && !(dev->features&NETIF_F_FRAGLIST) && skb_linearize(skb, GFP_ATOMIC) != 0) { kfree_skb(skb); return -ENOMEM; } /* Fragmented skb is linearized if device does not support SG, * or if at least one of fragments is in highmem and device * does not support DMA from it. */ if (skb_shinfo(skb)->nr_frags && (!(dev->features&NETIF_F_SG) || illegal_highdma(dev, skb)) && skb_linearize(skb, GFP_ATOMIC) != 0) { kfree_skb(skb); return -ENOMEM; } As a result, when a driver's "hard_start_xmit()" function receives an skb, it knows that the data to be transmitted start at "skb->data", that their length is "skb->len", and that they are virtually and physically continuous. As a result the driver can directly pass the "skb->data" pointer to the device's DMA controller, after converting it to a physical address, and synchronizing the relevant cache entries (by calling something like "pci_map_single()"). If---on the other hand---the driver sets the "NETIF_F_SG" bit in the "features" field of the "net_device" structure (declaring that it *can* do scatter-gather DMA), then any skb passed to it, might very-well hold data that are not physically continuous (and sometimes not even virtually continuous). In this case for every "skb" passed to the driver the networking stack also fills-in a "skb_shared_info" structure, defined in "include/linux/skbuff.h", like this: struct skb_shared_info { atomic_t dataref; unsigned int nr_frags; struct sk_buff *frag_list; skb_frag_t frags[MAX_SKB_FRAGS]; }; This structure is pointed-to by the "end" field of the "sk_buff" structure, so it can be accessed by the driver as: (struct skb_shared_info *)skb->end or even better using the macro "skb_shinfo", which is essentially the same: skb_shinfo(skb) It should by obvious that, in the scatter-gather case, the frame to be transmitted consist of a sequence of fragments (parts), each of which keeps a virtually and physically continuous subset of the data. The start of the first fragment is pointed by "skb->data" (as in the non-SG case), but its length (in bytes) is "skb->len - skb->data_len" (wich can also be accessed using the macto "skb_headlen()" defined in "include/linux/skbuff.h"). "skb->len" is still the length of the *full* frame (the sum of the lengths of all the fragments), and "skb->data_len" is the total length of all the data fragments not counting the first "header" fragment pointed by "skb->data". Actually a way to check if an skb is physically-continuous is to test if "skb->data_len" is non-zero; there is even a macro for this ("skb_is_nonlinear()") defined in "include/linux/skbuff.h". After the initial "header" fragment, there are exactly "skb_shinfo(skb)->nr_frags" fragments following. Each of these fragments is described by a "skb_frag_t" structure defined (in "include/linux/skbuff.h") as: struct skb_frag_struct { struct page *page; __u16 page_offset; __u16 size; }; ... typedef struct skb_frag_struct skb_frag_t; The "skb_frag_struct" structure corresponding to the I'th fragment can be accessed as: skb_shinfo(skb)->frags[I]; So the data of a non-linear "sk_buff" "skb" consist of the following parts, which are themselves linear (virtually and physically continuous): addr of addr of part # : first byte last byte ----------------------------------------- 0 : skb->data ... skb->len - skb->data_len - 1 1 : fr_adr(0) ... fr_adr(0) + fr_sz(0) - 1 . . nfrags : fr_adr(nfrags - 1) ... fr_adr(nfrags - 1) + fr_sz(nfrags - 1) - 1 where: "nfrags" is "skb_shinfo(skb)->nr_frags" and "fr_adr(i)" is "fr_pg_adr(i) + fr_pg_ofs(i)" "fr_pg_adr(i)" is "page_address(skb_shinfo(skb)->frags[i].page)" "fr_pg_ofs(i)" is "skb_shinfo(skb)->frags[i].page_offset" "fr_sz(i)" is "skb_shinfo(skb)->frags[i].size" NOTICE: "fr_adr", "fr_sz", "fr_pg_adr", and "fr_pg_ofs" are just symbolisms introduced to convenience out discussion, they are not actually defined as macros in the kernel. "page_address", on the other hand, is a real macro defined in "include/linux/mm.h" it also holds that: skb->data_len == fr_sz(0) + ... + fr_sz(nfrags - 1) For an example of how these are used in a real driver see "e100_main.c", and especially the function "e100_prepare_xmit_buff()" which contains all the details of handling the fragment-sequence. /npat -- But the delight and pride of Aule is in the deed of making, and in the thing made, and neither in possession nor in his own mastery; wherefore he gives and hoards not, and is free from care, passing ever on to some new work." -- J.R.R. Tolkien, Ainulindale (Silmarillion) - : send the line "unsubscribe linux-net" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html