Re: Complete TCP offload (or it's likeness)...

"David S. Miller" <davem@redhat.com> · Thu, 17 Oct 2002 23:39:46 -0700 (PDT)

   From: "Jordi Ros" <jros@ece.uci.edu>
   Date: Thu, 17 Oct 2002 23:08:35 -0700

   One of the biggest high tech manufacturers defined to me the "TCP
   end-node bottleneck" as a $1B business. That itself may be worth to
   spend some time doing research. In fact, the technology already has
   a name, TOE. Companies working on it? Many, many.

I welcome these companies to spend the money to look into this,
and when facts show that their solution can go head to head with
what's currently out there, then I'll be convinced.

Until then, I think it's a $1B business purely from a marketing
perspective.

Right now bus speeds and networking speeds limit networking processing
throughput and latency.  And if Moore's law is correct, the cpu will
catch up when we hit 10gbit for the _EXTREMELY LIMITED_ amount of
processing that is needed in the stack right.

Once you've offloaded the checksumming and the segmentation, as we do
right now, there simply isn't much else to do except basic socket
management and process wakeup.

The card isn't going to eliminate things like the process wakeup.

   > * At 1Gb/10Gb speeds, you must overcome problems like PCI bus
   > throughput.  This problem exists completely independent of where the TCP
   > stack is.

   Actually that is one of the strong arguments of the offload approach. With a
   tcp running in the kernel, there is a pci bus access in a per packet basis.
   In an offload tcp architecture, there is a pci bus access in a per buffer of
   data basis.

What you might not understand is that with TCP segmentation offload,
which we support, you effectively get EXACTLY this.

Only one set of headers go out over the bus for a 64K chunk of data.

This is old hat, nothing new, and nothing that requires TOE.

   What is the difference between a piece of code running on a board and a
   kernel running in a cpu which is only one pci bus away from the
   board?

Because the vendor has the code and I on the other hand have the code
to the Linux kernel TCP implementation.

You think these companies in this "$1B industry" are going to publish
their fancy TOE firmware publicly so I can fix bugs in it?  I really
doubt that.

Nobody, and mean not one, of these TOE folks have approached me and
said "and we'll GPL our TOE firmware etc. of course".  All of them
want to do binary-only firmware.

I fully support experiments doing TOE with complete GPL'd
implementations, including the firmware.

   Let's see, how much is the size of a socket? i forgot, let me run the code,
   be right back... in freebsd the size is 208 bytes. Now, how many connections
   are we looking at? let's say 10,000 (it is in fact less). This is about 2MB
   of RAM, not an issue for today's technology. One may also add the retx queue
   if we are dealing with TCP. Let's say worse case the retx queue is full, 32
   KB (default). Then you need 320 MB, not a problem either, we are talking
   about 10,000 connections, a server of such dimensions should have at least a
   few GB of RAM. Why not puting just 500MB on board?

You've forgotten details like the hash table for fast socket demux.
You need a lot more memory that what you describe.

Also, the fact still remains that the "logic" of packet loss handling
we have in the Linux kernel right now will be lost when we go to some
vendor's proprietary TCP implementation.

In fact that's half the damn value of Linux's TCP, our superior packet
loss/reordering detection written by Alexey.  I've had engineers who
have worked on TCP stacks for 10+ years email me privately saying "wow
I have to admit that is hot stuff."

And I want to reiterate again, where are these wonderful TOE cards
being used to produce specweb99 numbers competitive with normal
currently supported offloading mechanisms?  I do not see it.  And
I predict you will not see it.

   Some people that are very intimate with tcp and that are engineers, not
   marketing people (by that i mean people driven by engineering passions, not
   market ones), have arrived to the conclusion that puting tcp living in a
   generic purpose environment is not the best place. As the ietf moves forward
   with the vision of the sand clock layering (the layers in the middle, ie.
   tcp/ip, become more static and they are candidates to be moved to silicon),
   offloading tcp may be a solution for the edge-node problem. Just like math
   coprocessors have been designed in the past, or acceleration graphic cards
   were created closer to the main cpu, you may think now of the concept of
   network coprocessor. The goal, to improve the number of bits per cycle. Why?
   to have a more cost effective system.

What do you think we're doing right now with TCP segmentation and
checksum offloading?  We're eliminating the truly CPU intensive
portions of socket I/O handling.

Just like a block I/O layer merely submits 'I/O tags' plus data
pointers to storage devices, TCP is merely providing a header
template (which acts as the I/O tag) and a data pointer and telling
the card "have at it".

It can also be done on the receive side too.  People have even
implemented cards that do clever enough receive buffer processing
that coalesces streams of receive packets to the same flow into
contiguous page aligned buffers that may be flipped directly into
the filesystem or the user address space.

All of this without TOE.

I've actually seen some troubling mails that say the people who are
working on things like the receive packet flow data coalescing might
be told to stop working on that technology specifically because it
shows how unnecessary TOE really is.

So make no mistake that there are people in the hardware side of this
who actually side with us and want to work on what we believe is
proper offloading, but can't because the people who make the decisions
are telling them to do otherwise.

So where is the cpu time saving the TOE gives us?  All I've seen from
your statements is basically TCP segmentation offload, and we fully
support this already, it's old hat in fact.  I've shown that it can
be done on the receive side too.  We're not checksumming anything
either, so where is the cpu intensive part in all this?

And note that we haven't even begun to discuss the _costs_ of TOE,
the negative bits.  Such as:

1) Socket identity information has to be transferred to/from
   kernel to the card, either via DMA or PIO.  This can be
   expensive if the machine is handling lots of fast shortly
   lived connections, for example a transactional system such
   as database queries.

2) Once the stack is in the card, we lose control over things like
   interrupt mitigation.  If the hardware guys internally in their
   firmware don't do something like Linux's NAPI sw based interrupt
   mitigation, we have no way to apply this technique to their cards.

Nobody is exploring these kind of avenues when they discuss this
stuff.

   These are the technologies that can actually
   prove the value of the open source

Actually it's a wonderful opportunity to compromise the value
of open source, by pushing the important parts of the TCP
technology into proprietary firmware.

No, let's not go there thanks.

I'd believe that TOE could be "marketed" as necessary to consumers,
sure.  But whether it is actually indeed "necessary" is yet another
story.

And let's top talking about it, someone post some reproducable
kick-ass specweb99 results that are due to TOE and then we can
have a serious discussion.  Because everyone is postulating at
this point.
-
: send the line "unsubscribe linux-net" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html