On 3/13/20 12:50 PM, Shakeel Butt wrote:
On Fri, Mar 13, 2020 at 12:46 PM Yang Shi <yang.shi@xxxxxxxxxxxxxxxxx> wrote:
On 3/13/20 12:33 PM, Shakeel Butt wrote:
On Fri, Mar 13, 2020 at 11:34 AM Yang Shi <yang.shi@xxxxxxxxxxxxxxxxx> wrote:
When backporting commit 9c4e6b1a7027 ("mm, mlock, vmscan: no more
skipping pagevecs") to our 4.9 kernel, our test bench noticed around 10%
down with a couple of vm-scalability's test cases (lru-file-readonce,
lru-file-readtwice and lru-file-mmap-read). I didn't see that much down
on my VM (32c-64g-2nodes). It might be caused by the test configuration,
which is 32c-256g with NUMA disabled and the tests were run in root memcg,
so the tests actually stress only one inactive and active lru. It
sounds not very usual in mordern production environment.
That commit did two major changes:
1. Call page_evictable()
2. Use smp_mb to force the PG_lru set visible
It looks they contribute the most overhead. The page_evictable() is a
function which does function prologue and epilogue, and that was used by
page reclaim path only. However, lru add is a very hot path, so it
sounds better to make it inline. However, it calls page_mapping() which
is not inlined either, but the disassemble shows it doesn't do push and
pop operations and it sounds not very straightforward to inline it.
Other than this, it sounds smp_mb() is not necessary for x86 since
SetPageLRU is atomic which enforces memory barrier already, replace it
with smp_mb__after_atomic() in the following patch.
With the two fixes applied, the tests can get back around 5% on that
test bench and get back normal on my VM. Since the test bench
configuration is not that usual and I also saw around 6% up on the
latest upstream, so it sounds good enough IMHO.
The below is test data (lru-file-readtwice throughput) against the v5.6-rc4:
mainline w/ inline fix
150MB 154MB
What is the test setup for the above experiment? I would like to get a repro.
Just startup a VM with two nodes, then run case-lru-file-readtwice or
case-lru-file-readonce in vm-scalability in root memcg or with memcg
disabled. Then get the average throughput (dd result) from the test.
Our test bench uses the script from lkp, but I just ran it manually.
Single node VM should be more obvious showed in my test.
Thanks, I will try this on a real machine
Real machine should be better. Our test bench is bare metal with NUMA
disabled. On my test VM it is not that obvious.