[PATCH v2 net-next 0/3] ipv4: Hash-based multipath routing

pch@xxxxxxxxxxxx · Fri, 28 Aug 2015 22:00:47 +0200

When the routing cache was removed in 3.6, the IPv4 multipath algorithm changed
from more or less being destination-based into being quasi-random per-packet
scheduling. This increases the risk of out-of-order packets and makes it
impossible to use multipath together with anycast services.

This patch series seeks to extend the multipath system to support both L3 and
L4-based multipath while still supporting per-packet multipath.

The multipath algorithm is set as a per-route attribute (RTA_MP_ALGO) with some
degree of binary compatibility with the old implementation (2.6.12 - 2.6.22),
but without source level compatibility since attributes have different names:

RT_MP_ALG_L3_HASH:
L3 hash-based distribution. This was IP_MP_ALG_NONE, which with the route
cached behaved somewhat like L3-based distribution. This is the new default.

RT_MP_ALG_PER_PACKET:
Per-packet distribution. Was IP_MP_ALG_RR. Uses round-robin.

RT_MP_ALG_DRR, RT_MP_ALG_RANDOM, RT_MP_ALG_WRANDOM:
Unsupported values, but reserved because they existed in 2.6.12 - 2.6.22.

RT_MP_ALG_L4_HASH:
L4 hash-based distribution. This is new.

The traditional modulo approach is replaced by a threshold-based approach,
described in RFC 2992. This reduces disruption in case of link failures or
route changes.

To better support anycast environments where PMTU usually breaks with
multipath, certain ICMP packets are hashed using the IP addresses within the
ICMP payload when using L3 hashing. This ensures that ICMP packets are routed
over the same path as the flow they belong to. It is not enabled with L4
hashing, since we can only consistently rely on L4 information, when PMTU is
used, and PMTU may be used in one direction while not being used in the other.

As a side effect, the multipath spinlock was removed and the code got faster.
I measured ip_mkroute_input (excl. __mkroute_input) on a Xeon X3350 (4 cores,
2.66GHz) with two paths:

Old per-packet: ~393.9 cycles(tsc)
New per-packet:  ~75.2 cycles(tsc)
New L3:         ~107.9 cycles(tsc)
New L4:         ~129.1 cycles(tsc)

The timings are approximately the same with a single core, except for the old
per-packet which gets faster (~199.8 cycles) most likely because there is no
contention on the spinlock.

If this patch is accepted, a follow-up patch to iproute2 will also be
submitted.

Changes in v2:
- Replaced 8-bit xor hash with 31-bit jenkins hash
- Don't scale weights (since 31-bit)
- Avoided unnecesary renaming of variables
- Rely on DF-bit instead of fragment offset when checking for fragmentation
- upper_bound is now inclusive to avoid overflow
- Use a callback to postpone extracting flow information until necessary
- Skipped ICMP inspection entirely with L4 hashing
- Handle newly added sysctl ignore_routes_with_linkdown

Best Regards,
 Peter Nørlund

Peter Nørlund (3):
      ipv4: Lock-less per-packet multipath
      ipv4: L3 and L4 hash-based multipath routing
      ipv4: ICMP packet inspection for L3 multipath

 include/net/ip_fib.h           |  26 ++++++-
 include/net/route.h            |  12 ++-
 include/uapi/linux/rtnetlink.h |  14 +++-
 net/ipv4/Kconfig               |   1 +
 net/ipv4/fib_frontend.c        |   4 +
 net/ipv4/fib_semantics.c       | 168 ++++++++++++++++++++++++++---------------
 net/ipv4/icmp.c                |  34 ++++++++-
 net/ipv4/route.c               | 112 +++++++++++++++++++++++++--
 8 files changed, 298 insertions(+), 73 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html