As promised, I sat down and wrote up conditions and current problems with some of the calculations in the CCID 3 module. Changes are required only for the sender. In a nutshell, the strategy solves * the problem that dividing-by-seconds is not handled correctly * defines a sufficient resolution for tracking sending rates * gets rid of the overflow problem The cost is that * at least 1 and ideally 3 structure elements of tfrc_tx_info make it to __u64 ===> Arnaldo are you OK with this? but this has the advantage of tackling both overflow and the missing resolution of lower send rates. The detailed derivation is below, I am putting this into patches at the moment. It would be good to have this codified before Ian starts testing. Reasonable assumptions for and a corresponding scheme for the calculation of CCID 3 rates ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Problem: ======== The CCID 3 specification uses bytes, seconds, and bytes/second as units. Additionally, the timestamp options (Timestamp, Timestamp Echo, and Elapsed Time) require a representation of time in units of 10 * microseconds. The kernel internally uses microseconds to represent time. This results in many divisions of the type `s * 1E6 / t' where s is a packet size in bytes and t is a time unit in microseconds. (a) Problem 1: Overflow ------------------- Since currently receive/sending rates are stored as u32, this leads to overflow. There are several ways of tackling this: * the patch by Nishida-san implements a subset of rational arithmetic where all numbers are expressed as a struct fix-point { long long num; long long denom; }; The problem with this is that it has to be carried through for each single elementary operation (+,-,*,==,>,<,/,%), which is messy. * the current Linux implementation uses a function usecs_div(u32 a, u32 b) = a * 1E6 / b which computes a divisor in such a way that u32 overflow is avoided for a. While this goes a some way towards avoiding overflow, it comes at a price: since a piecewise-linear function is used, a constant is substituted for many different values. The result is a substantial computation error - this can be as high as 100% (I have a proof and test program). (b) Problem 2: What to do with non-microsecond time units ----------------------------------------------------- There are a few computations and constants in RFC 3448 and RFC 4342 which are not based on microsecond time units. The following table lists these s/t_mbi 4.3 bytes/second 0 if s < 64 s/(2*t_mbi) 4.4 bytes/second 0 if s < 32 X_recv = X_calc/4 4.4 bytes/second 0 if X_calc < 4 bytes/second X/2 4.4 bytes second These expressions are currently not handled correctly - this is a BUG in CCID 3. Solution: ========= A brute-force idea is to scaled everything by 1E6 and use bytes/microsecond (or 1e6 bytes/second) as basis for sending rates. But this introduces overflow for many calculations - even on 64 bit. i) Minimal granularity for sending/receiving rate -------------------------------------------------- A high granularity is not necessary: * [RFC 3448, 4.3] specifies the minimum sending rate as 1 packet in 64 seconds * the packet size `s' is never smaller than 1 byte and never larger than 2^32 - 1 (4 GByte), since this is (a) the largest size of an IPv6 jumbogram (RFC 2765) and (b) the type limit of the current representation * hence, 1/64 <= X <= 2^32 - 1 It is therefore fully sufficient to store X (X_recv, X_calc) in units of 64bytes/second. Internet speeds are growing fast, having a resolution of speeds higher than 1/64 is thus a truly wasted effort. ii) Implications for storage size ----------------------------------- Currently X, X_calc, X_recv are all stored as u32. Due to the above this is no longer sufficient: we are now at least 6 bits short. * the receiver does not necessarily need a modification since it sends X_recv as u32 anyway * at the sender it is however better to scale all X quantities. It is like enlarging a picture - everything needs to keep its proportions. * hence the changes will look like: struct tfrc_tx_info { __u64 tfrctx_x; /* this is really required */ __u64 tfrctx_x_recv; /* would be better to have as 64-bit */ __u64 tfrctx_x_calc; /* ditto */ /* ... */ }; * the advantages are that (a) usecs_div is now much simpler (b) it would mean a headache and a mess (but it is possible) to keep the sending rate in different sizes (scaled/unscaled); but with the above we don't need to (c) u64 is sufficient even when scaling both by 64 and by 1E6 iii) Implications for algorithm ------------------------------- This is a clerical job but needs to be done systematically: at the sender, all operations and comparisons have to be changed with regard to the scaled format. This is now done below. For the receiver, no changes are required. * Initialisation [RFC 3448, 4.2] Set X to 64 * 1 packet / second (i.e. X = s << 6) * Sender behaviour when feedback is received [RFC 3448, 4.3 - step (4)] ==> When X_recv is computed set X_recv = hctx->packet_options_x_recv << 6; If (p > 0) { X_calc = calcX(R, s, p); /* we leave that as it is for the moment */ X_calc << 6; /* scale by 64 */ X = max(min(X_calc, 2*X_recv), /* X_recv is also scaled */ (s << 6) / t_mbi) ); /* now it is correct */ } Else If(t_now - tld >= R) { X = max(2 * min(X, X_recv), scaled_div(s << 6, R); /* now also works */ tld = t_now } Where the analogue of usecs_div now is renamed into: u64 scaled_div(u64 size, u32 time) { return (size * 1000000) / time; } ==> step (5): the nofeedback timer now expires in max(4 * R, 2 * scaled_div(s, X >> 6)) /* to obtain microseconds */ This use is why I would like to suggest a renaming into scaled_div - we are not dividing by microseconds here. * Expiration of nofeedback timer [RFC 3448, 4.4] (a) If the sender already has got some feedback: If (X_calc > 2 * X_recv) /* comparison is in proportion */ X_recv = max(X_recv/2, (s << 6)/(2*t_mbi)); /* now works as expected */ Else X_recv = X_calc/4 ==> expire the nofeedback timer after max(4 * R, 2 * scaled_div(s, X >> 6)) /* to obtain microseconds */ (b) If the sender does not yet have any feedback then X = max(X/2, (s << 6)/t_mbi) /* now also works as expected */ ==> expire the nofeedback timer after 2 seconds * Preventing oscillations [RFC 3448, 4.5] This is currently not supported, but the solution is forward-compatible, since the scaled X can be kept as it is. * Scheduling of Packet Transmissions [RFC 3448, 4.6] t_ipi = scaled_div(s, X >> 6) /* yes, microseconds */ * Larger initial windows [RFC 4342, sec. 5] w_init = min(4 * s, max(2 * s, 4380)) X = scaled_div(w_init << 6, R) That's all ! - To unsubscribe from this list: send the line "unsubscribe dccp" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html