RE: Complete TCP offload (or it's likeness)...

"Jordi Ros" <jros@ece.uci.edu> · Sat, 19 Oct 2002 00:03:32 -0700

> someone post some reproducable
> kick-ass specweb99 results that are due to TOE and then we can
> have a serious discussion.

I agree with you, let's wait a few months. If we don't see any true TOE yet
is because people is being working seriously in a few garages. Technologies
don't show up out there overnight.

>>   Actually that is one of the strong arguments of the offload approach.
With a
>>   tcp running in the kernel, there is a pci bus access in a per packet
basis.
>>   In an offload tcp architecture, there is a pci bus access in a per
buffer of
>>   data basis.

> What you might not understand is that with TCP segmentation offload,
> which we support, you effectively get EXACTLY this.
> Only one set of headers go out over the bus for a 64K chunk of data.
> This is old hat, nothing new, and nothing that requires TOE.

The burden is with the control packets. In the case of TCP, you still need 1
ack for every two MTU units of data of that 64 KB chunk. You could try to
buffer them and coalescent them in the HW but then you are changing the
dynamics of the system because you have to add timers in the HW for the case
of packet lost. That would change your RTT computations screwing the flow
control. Again, you can get aroung probably with that too... but the picture
is bigger than what we have been talking so far. Let me give some insight
with this bigger picture using one of the statements previously posted:

> Right now bus speeds and networking speeds limit networking processing
> throughput and latency.  And if Moore's law is correct, the cpu will
> catch up when we hit 10gbit for the _EXTREMELY LIMITED_ amount of
> processing that is needed in the stack right.
(...)
> Once you've offloaded the checksumming and the segmentation, as we do
> right now, there simply isn't much else to do except basic socket
> management and process wakeup.

In fact Moore's law shows also that electronics cannot catch up with optics.
In the core of the Internet, you will have routers with MEMs capable to
switch packets without even going to the electronic domain. That can be done
because routing can be achieved by just looking at the color of the
wavelength, it will an all optical network. That shift itself enlarges the
communication pipes by 3 orders of magnitude. So the network is going
optics, yet the server stays electronics. We are not talking about small
percentages in performance improvement, rather we are talking about the need
of a communication system which is orders of magnitud faster. With TCP
segmentation and checksum offloading you can send 64 KB (if you are lucky
with the congestion window) at about 2000 cycles per chunk. We are talking
about building those 64 KB of data with 10s or 100s of cycles instead. Then
you can have a network card that can process your stack with a system clock
rate one or two orders of magnitud less than the host cpu. These are just
few technical arguments that have been proved in the labs (soon will be
disclosed in technical papers and scientist conferences), but i would also
like to talk about the protocol meaning of offloading tcp which is to me the
real need:

The reason to terminate TCP connections in a network processor is not only
to speed up TCP (which is all we have been talking about so far) but to have
an scalable protocol architecture. Let me give some insight here, which is
something that is publicly well known but yet has not been properly
communicated. TCP termination is needed if you want to further offload other
things. Example? iSCSI. And please trust me, there is a lot of people from
the storage side that are asking networking people to provide that solution.
Otherwise, how do you want to serve 10 Gbps iscsi with all the data running
through the pci bus and having multiple copies? let's think about it for a
sec, it is bulk data, why should we be going through the general purpose
kernel if that data is never touched? with the capability to terminate tcp
connections in the hw, you will be able to handle tera bytes of data (that
is what an optical network will transport in the future) without even going
though the pci bus (zero pci access). We are talking about solutions that
will be needed long term, tcp segmentation and checksum can only solve
current issues, but will not scale.

> I welcome these companies to spend the money to look into this,
> and when facts show that their solution can go head to head with
> what's currently out there, then I'll be convinced.
> Nobody, and mean not one, of these TOE folks have approached me and
> said "and we'll GPL our TOE firmware etc. of course".  All of them
> want to do binary-only firmware.

I hope that we can all here understand the needs of future communications.
There are already garage solutions that prove to work and improve the bits
per cycle 10 times. Within a year, you will see products that can deliver
from up to 100 bits per cycle, much more than today's 1 bit per cycle. But
we do need the help of the community for an optimal product
(interoperability).

If we don't get the help though, people will do it anyways. Because of the
community need and because of our engineering passions. The nice thing is
that open source allows for people to scratch their heads to come up with
better solutions. In fact, i have worked with the Linux kernel and i myself
would like to see Linux as a leader in future communication architectures,
since open source is the way to go to achieve optimal solutions.

My team of engineers and I are open to discuss further on how we can all
together make a better world for this particular bottleneck. But we have all
to be convinced first on the vision of the world that we would like to
craft. We have studied this very seriously for a long time, joining together
the best talents in TCP/IP and storage side, because we really believe
something more is needed. We have been working in the labs and now time is
coming for us to share our results with the community.

Jordi

-
: send the line "unsubscribe linux-net" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html