On Mon, Nov 21, 2022 at 4:53 PM Ivan Babrou <ivan@xxxxxxxxxxxxxx> wrote: > > Hello, > > We have observed a negative TCP throughput behavior from the following commit: > > * 8e8ae645249b mm: memcontrol: hook up vmpressure to socket pressure > > It landed back in 2016 in v4.5, so it's not exactly a new issue. > > The crux of the issue is that in some cases with swap present the > workload can be unfairly throttled in terms of TCP throughput. I guess defining 'fairness' in such a scenario is nearly impossible. Have you tried changing /proc/sys/net/ipv4/tcp_rmem (and/or tcp_wmem) ? Defaults are quite conservative. If for your workload you want to ensure a minimum amount of memory per TCP socket, that might be good enough. Of course, if your proxy has to deal with millions of concurrent TCP sockets, I fear this is not an option. > > I am able to reproduce this issue in a VM locally on v6.1-rc6 with 8 > GiB of RAM with zram enabled. > > The setup is fairly simple: > > 1. Run the following go proxy in one cgroup (it has some memory > ballast to simulate useful memory usage): > > * https://gist.github.com/bobrik/2c1a8a19b921fefe22caac21fda1be82 > > sudo systemd-run --scope -p MemoryLimit=6G go run main.go > > 2. Run the following fio config in another cgroup to simulate mmapped > page cache usage: > > [global] > size=8g > bs=256k > iodepth=256 > direct=0 > ioengine=mmap > group_reporting > time_based > runtime=86400 > numjobs=8 > name=randread > rw=randread > > [job1] > filename=derp > > sudo systemd-run --scope fio randread.fio > > 3. Run curl to request a large file via proxy: > > curl -o /dev/null http://localhost:4444 > > 4. Observe low throughput. The numbers here are dependent on your > location, but in my VM the throughput drops from 60MB/s to 10MB/s > depending on whether fio is running or not. > > I can see that this happens because of the commit I mentioned with > some perf tracing: > > sudo perf probe --add 'vmpressure:48 memcg->css.cgroup->kn->id scanned > vmpr_scanned=vmpr->scanned reclaimed vmpr_reclaimed=vmpr->reclaimed' > sudo perf probe --add 'vmpressure:72 memcg->css.cgroup->kn->id' > > I can record the probes above during curl runtime: > > sudo perf record -a -e probe:vmpressure_L48,probe:vmpressure_L72 -- sleep 5 > > Line 48 allows me to observe scanned and reclaimed page counters, line > 72 is the actual throttling. > > Here's an example trace showing my go proxy cgroup: > > kswapd0 89 [002] 2351.221995: probe:vmpressure_L48: (ffffffed2639dd90) > id=0xf23 scanned=0x140 vmpr_scanned=0x0 reclaimed=0x0 > vmpr_reclaimed=0x0 > kswapd0 89 [007] 2351.333407: probe:vmpressure_L48: (ffffffed2639dd90) > id=0xf23 scanned=0x2b3 vmpr_scanned=0x140 reclaimed=0x0 > vmpr_reclaimed=0x0 > kswapd0 89 [007] 2351.333408: probe:vmpressure_L72: (ffffffed2639de2c) id=0xf23 > > We scanned lots of pages, but weren't able to reclaim anything. > > When throttling happens, it's in tcp_prune_queue, where rcv_ssthresh > (TCP window clamp) is set to 4 x advmss: > > * https://elixir.bootlin.com/linux/v5.15.76/source/net/ipv4/tcp_input.c#L5373 > > else if (tcp_under_memory_pressure(sk)) > tp->rcv_ssthresh = min(tp->rcv_ssthresh, 4U * tp->advmss); > > I can see plenty of memory available in both my go proxy cgroup and in > the system in general: > > $ free -h > total used free shared buff/cache available > Mem: 7.8Gi 4.3Gi 104Mi 0.0Ki 3.3Gi 3.3Gi > Swap: 11Gi 242Mi 11Gi > > It just so happens that all of the memory is hot and is not eligible > to be reclaimed. Since swap is enabled, the memory is still eligible > to be scanned. If swap is disabled, then my go proxy is not eligible > for scanning anymore (all memory is anonymous, nowhere to reclaim it), > so the whole issue goes away. > > Punishing well behaving programs like that doesn't seem fair. We saw > production metals with 200GB page cache out of 384GB of RAM, where a > well behaved proxy with 60GB of RAM + 15GB of swap is throttled like > that. The fact that it only happens with swap makes it extra weird. > > I'm not really sure what to do with this. From our end we'll probably > just pass cgroup.memory=nosocket in cmdline to disable this behavior > altogether, since it's not like we're running out of TCP memory (and > we can deal with that better if it ever comes to that). There should > probably be a better general case solution. Probably :) > > I don't know how widespread this issue can be. You need a fair amount > of page cache pressure to try to go to anonymous memory for reclaim to > trigger this. > > Either way, this seems like a bit of a landmine.