When the routing cache was removed in 3.6, the IPv4 multipath algorithm changed from more or less being destination-based into being quasi-random per-packet scheduling. This increases the risk of out-of-order packets and makes it impossible to use multipath together with anycast services. This patch series seeks to extend the multipath system to support both L3 and L4-based multipath while still supporting per-packet multipath. The multipath algorithm is set as a per-route attribute (RTA_MP_ALGO) with some degree of binary compatibility with the old implementation (2.6.12 - 2.6.22), but without source level compatibility since attributes have different names: RT_MP_ALG_L3_HASH: L3 hash-based distribution. This was IP_MP_ALG_NONE, which with the route cached behaved somewhat like L3-based distribution. This is the new default. RT_MP_ALG_PER_PACKET: Per-packet distribution. Was IP_MP_ALG_RR. Uses round-robin. RT_MP_ALG_DRR, RT_MP_ALG_RANDOM, RT_MP_ALG_WRANDOM: Unsupported values, but reserved because they existed in 2.6.12 - 2.6.22. RT_MP_ALG_L4_HASH: L4 hash-based distribution. This is new. The traditional modulo approach is replaced by a threshold-based approach, described in RFC 2992. This reduces disruption in case of link failures or route changes. To better support anycast environments where PMTU usually breaks with multipath, certain ICMP packets are hashed using the IP addresses within the ICMP payload when using L3 hashing. This ensures that ICMP packets are routed over the same path as the flow they belong to. It is not enabled with L4 hashing, since we can only consistently rely on L4 information, when PMTU is used, and PMTU may be used in one direction while not being used in the other. As a side effect, the multipath spinlock was removed and the code got faster. I measured ip_mkroute_input (excl. __mkroute_input) on a Xeon X3350 (4 cores, 2.66GHz) with two paths: Old per-packet: ~393.9 cycles(tsc) New per-packet: ~75.2 cycles(tsc) New L3: ~107.9 cycles(tsc) New L4: ~129.1 cycles(tsc) The timings are approximately the same with a single core, except for the old per-packet which gets faster (~199.8 cycles) most likely because there is no contention on the spinlock. If this patch is accepted, a follow-up patch to iproute2 will also be submitted. Changes in v2: - Replaced 8-bit xor hash with 31-bit jenkins hash - Don't scale weights (since 31-bit) - Avoided unnecesary renaming of variables - Rely on DF-bit instead of fragment offset when checking for fragmentation - upper_bound is now inclusive to avoid overflow - Use a callback to postpone extracting flow information until necessary - Skipped ICMP inspection entirely with L4 hashing - Handle newly added sysctl ignore_routes_with_linkdown Best Regards, Peter Nørlund Peter Nørlund (3): ipv4: Lock-less per-packet multipath ipv4: L3 and L4 hash-based multipath routing ipv4: ICMP packet inspection for L3 multipath include/net/ip_fib.h | 26 ++++++- include/net/route.h | 12 ++- include/uapi/linux/rtnetlink.h | 14 +++- net/ipv4/Kconfig | 1 + net/ipv4/fib_frontend.c | 4 + net/ipv4/fib_semantics.c | 168 ++++++++++++++++++++++++++--------------- net/ipv4/icmp.c | 34 ++++++++- net/ipv4/route.c | 112 +++++++++++++++++++++++++-- 8 files changed, 298 insertions(+), 73 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-api" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html