By way of introduction, I work for Network Elements, Inc, and we're working on a highly accelerated NIC architecture for which our first drivers will be in Linux. It is supposted to have 4 gigabit ports and implement complete TCP offloading. Personally, I'm only mildly for "complete" TCP offloading, but that is the path I've been ordered down. So, to be clear, I'm going to ask the list about 2 things: #1: I want to talk about minimizing the impact/getting the most interoperable architecture with the standard Linux TCP/IP stack. The main AF_INET socket switch seems simple enough to do with no core modifications, but keeping the ARP and Route info up-to-date, not to mention things like packet filtering, get murky quite quickly. Have no fear, I have already been told by one Linux dude that he didn't think it was a good idea, so your flames will be in good standing. ;-) However, the problem is that we see problems with a gigabit card hitting wire speed even on a fast system, and we were originally contemplating a 10 Gbit card. Even the TCP window copy gets kinda prohibitive. So, we asked the question of how fast can you go when you copy the buffers directly out of user space on a system with I/O like PCI-X (1 GB/s total) or PCI-X 2 (2-4 GB/s total), and for reception directly back into user space. Some tests revealed that it's quite a bit faster. I could provide hard numbers comparing to a few gigabit cards in a few days. Unfortunately I don't have the really fast SysKonnect cards on hand to do the testing with again, I'll be getting some in about a week or so. Obvious Cons: -- lacks good integration with Linux TCP stack features (such as ip filtering). -- keeping ARP and ROUTE info up-to-date kinda murky. #2: TCP (& UDP) processing overhead seems to consist of (on some test systems that I have available running 2.4.18 and using several different gigabit cards): Transmission: -- TCP (not UDP) window copy (one copy at least) -- Per-Packet header/etc. overhead (Segmentation can take care of much of this). -- Packet copy to card (checksums can be done in hardware or with the copy). Reception: -- Packet storage prior to reassembly -- Per-Packet matching to figure out which stream/port it's on and for TCP what position it has in the stream. As a compromise, I have been mulling over the idea of proposing a mechanism that would allow us to put the TCP window on the card and a way to use TCP segmentation with that to get higher performance while still cooperating with the host stack. For reception, I have no great ideas yet other than having a similar buffer on the card for the incoming TCP window, and looking into ways to do hardware-assisted classification and reassembly directly into user space without per-packet and extra copy overhead. -- Erich Stefan Boleyn <erich@uruk.org> http://www.uruk.org/ "Reality is truly stranger than fiction; Probably why fiction is so popular" - : send the line "unsubscribe linux-net" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html