On Fri, May 22, 2020 at 6:51 AM Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > From: Dave Chinner <dchinner@xxxxxxxxxx> > > Now that dirty inode writeback doesn't cause read-modify-write > cycles on the inode cluster buffer under memory pressure, the need > to throttle memory reclaim to the rate at which we can clean dirty > inodes goes away. That is due to the fact that we no longer thrash > inode cluster buffers under memory pressure to clean dirty inodes. > > This means inode writeback no longer stalls on memory allocation > or read IO, and hence can be done asynchrnously without generating > memory pressure. As a result, blocking inode writeback in reclaim is > no longer necessary to prevent reclaim priority windup as cleaning > dirty inodes is no longer dependent on having memory reserves > available for the filesystem to make progress reclaiming inodes. > > Hence we can convert inode reclaim to be non-blocking for shrinker > callouts, both for direct reclaim and kswapd. > > On a vanilla kernel, running a 16-way fsmark create workload on a > 4 node/16p/16GB RAM machine, I can reliably pin 14.75GB of RAM via > userspace mlock(). The OOM killer gets invoked at 15GB of > pinned RAM. > > With this patch alone, pinning memory triggers premature OOM > killer invocation, sometimes with as much as 45% of RAM being free. > It's trivially easy to trigger the OOM killer when reclaim does not > block. > > With pinning inode clusters in RAM adn then adding this patch, I can typo: adn > reliably pin 14.5GB of RAM and still have the fsmark workload run to > completion. The OOM killer gets invoked 14.75GB of pinned RAM, which > is only a small amount of memory less than the vanilla kernel. It is > much more reliable than just with async reclaim alone. > > simoops shows that allocation stalls go away when async reclaim is > used. Vanilla kernel: > > Run time: 1924 seconds > Read latency (p50: 3,305,472) (p95: 3,723,264) (p99: 4,001,792) > Write latency (p50: 184,064) (p95: 553,984) (p99: 807,936) > Allocation latency (p50: 2,641,920) (p95: 3,911,680) (p99: 4,464,640) > work rate = 13.45/sec (avg 13.44/sec) (p50: 13.46) (p95: 13.58) (p99: 13.70) > alloc stall rate = 3.80/sec (avg: 2.59) (p50: 2.54) (p95: 2.96) (p99: 3.02) > > With inode cluster pinning and async reclaim: > > Run time: 1924 seconds > Read latency (p50: 3,305,472) (p95: 3,715,072) (p99: 3,977,216) > Write latency (p50: 187,648) (p95: 553,984) (p99: 789,504) > Allocation latency (p50: 2,748,416) (p95: 3,919,872) (p99: 4,448,256) > work rate = 13.28/sec (avg 13.32/sec) (p50: 13.26) (p95: 13.34) (p99: 13.34) > alloc stall rate = 0.02/sec (avg: 0.02) (p50: 0.01) (p95: 0.03) (p99: 0.03) > > Latencies don't really change much, nor does the work rate. However, > allocation almost never stalls with these changes, whilst the > vanilla kernel is sometimes reporting 20 stalls/s over a 60s sample > period. This difference is due to inode reclaim being largely > non-blocking now. > > IOWs, once we have pinned inode cluster buffers, we can make inode > reclaim non-blocking without a major risk of premature and/or > spurious OOM killer invocation, and without any changes to memory > reclaim infrastructure. > > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> > --- I can confirm observing the stalls on production servers and that this patch alone fixes the stalls. Reviewed-by: Amir Goldstein <amir73il@xxxxxxxxx> Thanks, Amir.