On Tue, 26 Sep 2006 17:55:48 -0700 Ben Woodard <woodard@xxxxxxxxxx> wrote: > Here at LLNL we have a rather challenging network environment on our > clusters. We basically have 1000's of gigE links attached to an > oversubscribed federated network. Most of the time this network is idle > but the expected workload is for regular spikes extremely heavy activity > lasting a few minutes. All end-points in a highly coordinated manor, > typically after exiting an MPI barrier, start pushing as much data as > possible through the oversubscribed core. The result is a wave of TCP > back-offs where all the TCP streams back-off in lock step. The network > oscillates from highly congested for brief moments to largely idle. > Given enough time TCP will settle down in to something mostly reasonable > but even then it causes us a few problems: How far between flood points? Are the connections correctly going back to slow start? Why not just set the cwnd clamp for that path to be low enough to avoid excessive greediness? The clamp is per TCP connection so if you have application specific knowledge you could just set the limit to be: Bandwidth Delay Product / N connections = Cwnd limit Probably add 10% to allow for some settling. Also, perhaps you are seeing effect of older kernel and buggy version of BIC? -- Stephen Hemminger <shemminger@xxxxxxxx> - To unsubscribe from this list: send the line "unsubscribe linux-net" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html