From: "Jordi Ros" <jros@ece.uci.edu> Date: Thu, 17 Oct 2002 23:08:35 -0700 One of the biggest high tech manufacturers defined to me the "TCP end-node bottleneck" as a $1B business. That itself may be worth to spend some time doing research. In fact, the technology already has a name, TOE. Companies working on it? Many, many. I welcome these companies to spend the money to look into this, and when facts show that their solution can go head to head with what's currently out there, then I'll be convinced. Until then, I think it's a $1B business purely from a marketing perspective. Right now bus speeds and networking speeds limit networking processing throughput and latency. And if Moore's law is correct, the cpu will catch up when we hit 10gbit for the _EXTREMELY LIMITED_ amount of processing that is needed in the stack right. Once you've offloaded the checksumming and the segmentation, as we do right now, there simply isn't much else to do except basic socket management and process wakeup. The card isn't going to eliminate things like the process wakeup. > * At 1Gb/10Gb speeds, you must overcome problems like PCI bus > throughput. This problem exists completely independent of where the TCP > stack is. Actually that is one of the strong arguments of the offload approach. With a tcp running in the kernel, there is a pci bus access in a per packet basis. In an offload tcp architecture, there is a pci bus access in a per buffer of data basis. What you might not understand is that with TCP segmentation offload, which we support, you effectively get EXACTLY this. Only one set of headers go out over the bus for a 64K chunk of data. This is old hat, nothing new, and nothing that requires TOE. What is the difference between a piece of code running on a board and a kernel running in a cpu which is only one pci bus away from the board? Because the vendor has the code and I on the other hand have the code to the Linux kernel TCP implementation. You think these companies in this "$1B industry" are going to publish their fancy TOE firmware publicly so I can fix bugs in it? I really doubt that. Nobody, and mean not one, of these TOE folks have approached me and said "and we'll GPL our TOE firmware etc. of course". All of them want to do binary-only firmware. I fully support experiments doing TOE with complete GPL'd implementations, including the firmware. Let's see, how much is the size of a socket? i forgot, let me run the code, be right back... in freebsd the size is 208 bytes. Now, how many connections are we looking at? let's say 10,000 (it is in fact less). This is about 2MB of RAM, not an issue for today's technology. One may also add the retx queue if we are dealing with TCP. Let's say worse case the retx queue is full, 32 KB (default). Then you need 320 MB, not a problem either, we are talking about 10,000 connections, a server of such dimensions should have at least a few GB of RAM. Why not puting just 500MB on board? You've forgotten details like the hash table for fast socket demux. You need a lot more memory that what you describe. Also, the fact still remains that the "logic" of packet loss handling we have in the Linux kernel right now will be lost when we go to some vendor's proprietary TCP implementation. In fact that's half the damn value of Linux's TCP, our superior packet loss/reordering detection written by Alexey. I've had engineers who have worked on TCP stacks for 10+ years email me privately saying "wow I have to admit that is hot stuff." And I want to reiterate again, where are these wonderful TOE cards being used to produce specweb99 numbers competitive with normal currently supported offloading mechanisms? I do not see it. And I predict you will not see it. Some people that are very intimate with tcp and that are engineers, not marketing people (by that i mean people driven by engineering passions, not market ones), have arrived to the conclusion that puting tcp living in a generic purpose environment is not the best place. As the ietf moves forward with the vision of the sand clock layering (the layers in the middle, ie. tcp/ip, become more static and they are candidates to be moved to silicon), offloading tcp may be a solution for the edge-node problem. Just like math coprocessors have been designed in the past, or acceleration graphic cards were created closer to the main cpu, you may think now of the concept of network coprocessor. The goal, to improve the number of bits per cycle. Why? to have a more cost effective system. What do you think we're doing right now with TCP segmentation and checksum offloading? We're eliminating the truly CPU intensive portions of socket I/O handling. Just like a block I/O layer merely submits 'I/O tags' plus data pointers to storage devices, TCP is merely providing a header template (which acts as the I/O tag) and a data pointer and telling the card "have at it". It can also be done on the receive side too. People have even implemented cards that do clever enough receive buffer processing that coalesces streams of receive packets to the same flow into contiguous page aligned buffers that may be flipped directly into the filesystem or the user address space. All of this without TOE. I've actually seen some troubling mails that say the people who are working on things like the receive packet flow data coalescing might be told to stop working on that technology specifically because it shows how unnecessary TOE really is. So make no mistake that there are people in the hardware side of this who actually side with us and want to work on what we believe is proper offloading, but can't because the people who make the decisions are telling them to do otherwise. So where is the cpu time saving the TOE gives us? All I've seen from your statements is basically TCP segmentation offload, and we fully support this already, it's old hat in fact. I've shown that it can be done on the receive side too. We're not checksumming anything either, so where is the cpu intensive part in all this? And note that we haven't even begun to discuss the _costs_ of TOE, the negative bits. Such as: 1) Socket identity information has to be transferred to/from kernel to the card, either via DMA or PIO. This can be expensive if the machine is handling lots of fast shortly lived connections, for example a transactional system such as database queries. 2) Once the stack is in the card, we lose control over things like interrupt mitigation. If the hardware guys internally in their firmware don't do something like Linux's NAPI sw based interrupt mitigation, we have no way to apply this technique to their cards. Nobody is exploring these kind of avenues when they discuss this stuff. These are the technologies that can actually prove the value of the open source Actually it's a wonderful opportunity to compromise the value of open source, by pushing the important parts of the TCP technology into proprietary firmware. No, let's not go there thanks. I'd believe that TOE could be "marketed" as necessary to consumers, sure. But whether it is actually indeed "necessary" is yet another story. And let's top talking about it, someone post some reproducable kick-ass specweb99 results that are due to TOE and then we can have a serious discussion. Because everyone is postulating at this point. - : send the line "unsubscribe linux-net" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html