On Fri, 2006-07-07 at 10:00 +0200, Patrick McHardy wrote: > Russell Stuart wrote: > > Unfortunately you do things in the wrong order for ATM. > > See: http://mailman.ds9a.nl/pipermail/lartc/2006q1/018314.html > > for an overview of the problem, and then the attached email for > > a detailed description of how the current patch addresses it. > > It is a trivial fix. > > Actually that was the part I didn't understand, you keep talking > (also in that comment in tc_core.c) about an "unknown overhead". > What is that and why would it be unknown? The mail you attached > is quite long, is there an simple example that shows what you > mean? The "unknown overhead" is just the overhead passed to tc using the "tc ... overhead xxx" option. It is probably what you intended to put into your addend attribute. It is "unknown" because the kernel currently doesn't use it. It is passed in the tc_ratespec, but is ignored by the kernel as are most fields in there. The easy way to fix the "ATM" problem described in the big comment is simply to add the "overhead" to the packet length before doing the RTAB lookup. (Identical comments apply to STAB). If you don't accept this or understand why, then go read the "long emails" which attempt to explain it in detail. Jesper's initial version of the patch did just that, BTW. However if you do that then you have to adjust RTAB for all cases (not just ATM) to reflect that the kernel is now adding the overhead. Thus the RTAB tc sends to the kernel now changes for different kernel versions, making modern versions of tc incompatible with older kernels, and visa versa. I didn't consider that acceptable. My solution to this to give the kernel the old format RTAB (ie the one that assumed the kernel didn't add the overhead) and a small adjustment. This small adjustment is called cell_align in the ATM patch. You do the same thing with cell_align as the previous solution did with the overhead - ie add it in just before looking up RTAB. This is in effect all the kernel part of the ATM patch does - make the kernel accept the cell_align option, and add it to skb->len before looking up RTAB. The difference between cell_align and overhead is that cell_align is always 0 when there is no packetisation, and even when non zero it is small (less than 1<<cell_log, ie less than 8 for typical MTU's). So for anything bar ATM it is zero which means old kernels are completely unaffected, and even for ATM not sending it produces a small error which means older kernels still benefit from the "ATM" user space patch. This makes the proposed "ATM" version of tc both forward and backward compatible. One other point arises here. The fields in "tc_ratespec" that "tc" fills and the kernel ignores are there so "tc show" will work. The essence of the problem is "tc" compiles the stuff you give it into a single "RTAB". That "RTAB" can't be reverse compiled into the original numbers the user provided. So if "tc show" is to work, "tc" has to save that information somewhere. I don't think the "tc_ratespec" was the best choice for two reasons. Firstly, having the fields show up in tc_ratespec makes it seem like the kernel can use them. It can't, as the "overhead" example above shows. Secondly, from tc's point of view it is inflexible. Over time new features have been be added to "tc", and each time a new way of encoding it in the existing "tc_ratespec" has to be invented. Thus we now have hacks like the storing the "overhead" in the upper bits of the MPU figure. A better solution would be to provide a TLV (ie a TCA_XXX constant) for TC's private use. From the kernels point of view it would be an opaque structure which just saves and echos back when asked. This would solve both problems. > > However, now you lot have made me go away and think, I have > > another idea on how to attack this. Perhaps it will be > > more palatable to you. It would replace RTAB and STAB with > > a 28 byte structure for most protocol stacks - well all I can > > think of off the top of my head, anyway. RTAB would have to > > remain for backwards compatibility, of course. > > Can you describe in more detail? OK, but first I want to make the point that the only reason I suggest this is to get some sort of ATM patch into the kernel, as the current patch on the table is having a rough time. Alan Cox made the point earlier (if I understood him correctly) that this tabling lookup probably isn't a big win on modern CPU's - we may be better off moving it all into the kernel. Thinking about this, I tried to come up with a way of describing the mapping between skb->len and the on the wire packet length for every protocol I know. This is what I came up with. Assume we have a packet length L, which is to be transported by some protocol. For now we consider one protocol only, ie: TCP, PPP, ATM, Ethernet or whatever. I will generalise it to multiple protocols later. I think a generalised transformation can be made using using 5 numbers which are applied in this order: Overhead - A fixed overhead that is added to L. Mpu - Minimum packet size. If the result of (Overhead+L) is smaller that this, then the new result becomes this size. Round - The result is then rounded up to this many bytes. For protocols that always transmit single bytes this figure would be 1. If there were some protocol that transmitted data as 4 byte chunks then this would be 4. For ATM it is 48. CellPay - If the packet is broken down into smaller packets when sent, then this is the amount of data that will fit into each chunk. CallOver - This is the additional overhead each cell carries. The idea is the kernel would do this calculation on the fly for each packet. If you represent this set of number numbers as a comma separated list in the order they were presented above, then here are some examples: IP: 20 Ethernet: 18,64 PPP: 2 ATM: 0,0,48,48,5 It may be that 5 numbers are a overkill. It is for all protocols I am aware of - for those you could get away with 4. But I am no expert. The next step is to generalise for many protocols. As the protocols are stacked the length output by one protocol becoming the input length for the downstream one. So we just need to apply the same transformation serially. I will use '+' to indicate the stacking. For a typical ATM stack, PPPoE over LLC, we have: ppp:2+pppoe:6+ethernet:14,64+llc:8+all5:4+atm:0,0,48,48,5 If this were implemented naively, then the kernel would have to apply the above calculation 6 times, like this: Protocol InputLength OutputLength --------- ------------ ---------------- ppp skb->len skb->len+2 pppoe: skb->len+2 skb->len+2+6 ethernet: skb->len+2+6 skb->len+2+6+14 ... and so on. But it can be optimised. In this particular case we can combine those six operations into 1: adsl_pppoe_llc:34,64,48,48,5 The five numbers have the same meaning as before. It it not difficult to come up with a generalised rule that allows you to do this for most cases. For the remainder (if they exist - I can't think of any) the kernel would have to apply the transformation iteratively. Before going on, it is worth while comparing this to the current RTAB solution (and by implication STAB): 1. Oddly, the number of steps and hence speed for common protocols is probably the same. Compare: RTAB - You have to add an OverHead in the general case. - You have to scale by cell_log. - You have to ensure the overhead+skb->len doesn't overflow / underflow the RTAB. - You have to do the lookup. New - You have to add overhead. - You have to check the MPU. - You have to check if you have to apply Round,CellPay,CellOver - but you won't have to for any protocol except ATM. 2. Because of the cell_log, RTAB gives an 100% accurate answer 1 time in every (1<<cell_log) packet lengths. The new version is always 100% accurate. 3. The new version isn't as flexible as RTAB, as least from the kernels point of view. Conceivably there are protocols that could be handled by RTAB that are not handled by the new one. Since RTAB is computed in user space, this implies these new protocols might be handled by a user space change only. The new version would always require a kernel change. Note however that even RTAB required a kernel change for ATM, however. So far we have what your STAB would provide us with - a way to calculate the packet length. It takes 5 int's for every protocol stack I can think of. It probably runs faster for most protocols but is less robust to the introduction of new protocols. But some qdisc's need to know how long it takes to send a packet. This is what RTAB provides us with, in fact. So if we were to do away with RTAB completely, then we need a way for the kernel to covert packet lengths into the time it takes to send a packet. This is what I discuss next. The comments apply to both STAB and the new algorithm above, as they both compute packet lengths. If J is the time it takes to send one byte over a link, then we can compute RTAB from STAB like this: for (int i = 0; i < array_length(STAB); i += 1) RTAB[i] = STAB[i] * J; This is exactly the operation "tc" performs now in user space. It is possibly in user space because J is usually less than 1, and thus is most conveniently represented as a floating point number. Floating point operations in the kernel are verboten. I can think of two ways to move this operation into the kernel. The straight forward way is to represent J as a scale and a division. Ie the kernel does: RTAB[i] = (STAB[i] << scale) / divisor. The second way depends on that fact that most CPU's can multiply two 32 uint's to produce a 64 bit result in a single operation. I don't know whether this operation is available to kernel code. But if it is, J can be represented as a 32 bit fixed point number, with the implied decimal point after the most significant 8 bits. Then this operation would suffice: extern long long mul64(unsigned long, a, unsigned long b); RTAB[i] = (unsigned long)(mul64(STAB[i], J) >> 24); This method doesn't use division, and is probably faster on lower end CPU's. It would handle 100G Ethernet on a machine with Hz == 1000, and 1200 bits/sec on a machine with Hz == 10000. _______________________________________________ LARTC mailing list LARTC@xxxxxxxxxxxxxxx http://mailman.ds9a.nl/cgi-bin/mailman/listinfo/lartc