Re: [patch 1/4] mm: memcontrol: reduce reclaim invocations for higher order requests

Michal Hocko <mhocko@xxxxxxx> · Wed, 13 Aug 2014 16:59:04 +0200

On Fri 08-08-14 09:26:35, Johannes Weiner wrote:
> On Fri, Aug 08, 2014 at 02:32:58PM +0200, Michal Hocko wrote:
> > On Thu 07-08-14 11:31:41, Johannes Weiner wrote:
[...]
> > > THP latencies are actually the same when comparing high limit nr_pages
> > > reclaim with the current hard limit SWAP_CLUSTER_MAX reclaim,
> > 
> > Are you sure about this? I fail to see how they can be same as THP
> > allocations/charges are __GFP_NORETRY so there is only one reclaim
> > round for the hard limit reclaim followed by the charge failure if
> > it is not successful.
> 
> I use this test program that faults in anon pages, reports average and
> max for every 512-page chunk (THP size), then reports the aggregate at
> the end:
> 
> memory.max:
> 
> avg=18729us max=450625us
> 
> real    0m14.335s
> user    0m0.157s
> sys     0m6.307s
> 
> memory.high:
> 
> avg=18676us max=457499us
> 
> real    0m14.375s
> user    0m0.046s
> sys     0m4.294s

I was playing with something like that as well. mmap 800MB anon mapping
in 256MB memcg (kvm guest had 1G RAM and 2G swap so the global reclaim
doesn't trigger and the host 2G free memory), start faulting in from
THP aligned address and measured each fault. Then I was recording
mm_vmscan_lru_shrink_inactive and mm_vmscan_memcg_reclaim_{begin,end}
tracepoints to see how the reclaim went.

I was testing two setups
1) fault in every 4k page
2) fault in only 2M aligned addresses.

The first simulates the case where successful THP allocation saves
follow up 511 fallback charges and so the excessive reclaim might
pay off.
The second one simulates potential time wasting when memory is used
extremely sparsely and any latencies would be unwelcome.

(new refers to nr_reclaim target, old to SWAP_CLUSTER_MAX, thponly
faults only 2M aligned addresses, 4k pages are faulted otherwise)

vmstat says:
out.256m.new-thponly.vmstat.after:pswpin 44
out.256m.new-thponly.vmstat.after:pswpout 154681
out.256m.new-thponly.vmstat.after:thp_fault_alloc 399
out.256m.new-thponly.vmstat.after:thp_fault_fallback 0
out.256m.new-thponly.vmstat.after:thp_split 302

out.256m.old-thponly.vmstat.after:pswpin 28
out.256m.old-thponly.vmstat.after:pswpout 31271
out.256m.old-thponly.vmstat.after:thp_fault_alloc 149
out.256m.old-thponly.vmstat.after:thp_fault_fallback 250
out.256m.old-thponly.vmstat.after:thp_split 61

out.256m.new.vmstat.after:pswpin 48
out.256m.new.vmstat.after:pswpout 169530
out.256m.new.vmstat.after:thp_fault_alloc 399
out.256m.new.vmstat.after:thp_fault_fallback 0
out.256m.new.vmstat.after:thp_split 331

out.256m.old.vmstat.after:pswpin 47
out.256m.old.vmstat.after:pswpout 156514
out.256m.old.vmstat.after:thp_fault_alloc 127
out.256m.old.vmstat.after:thp_fault_fallback 272
out.256m.old.vmstat.after:thp_split 127

As expected new managed to fault in all requests as THP without a single
fallback allocation while with the old reclaim we got to the limit and
then most of the THP charges failed and fallen back to single page
charge.

Note the increased swapout activity for new. It is almost 5x more for
thponly and +8% with per-page faults. This looks like a fallout from the
over-reclaim in smaller priorities.

Tracepoints will tell us the priority at which we ended up the reclaim
round:
- trace.new-thponly
  Count Priority
      1 3
      2 5
    159 6
     24 7
- trace.old-thponly
    230 10
      1 11
      1 12
      1 3
     39 9

Again expected that the priority is falling down for the new much more.

- trace.new
    229 0
      3 12
- trace.old
    294 0
      2 1
     25 10
      1 11
      3 12
      8 2
      8 3
     20 4
     33 5
     21 6
     43 7
   1286 8
   1279 9

And here as well, we have to reclaim much more because we do much more
charges so the load benefits a bit from the high reclaim target.

mm_vmscan_memcg_reclaim_end tracepoint tells us also how many pages were
reclaimed during each run and the cummulative numbers are:
- trace.new-thponly: 139029
- trace.old-thponly: 11344
- trace.new: 139687
- trace.old: 139887

time -v says:
out.256m.new-thponly.time:      System time (seconds): 1.50
out.256m.new-thponly.time:      Elapsed (wall clock) time (h:mm:ss or m:ss): 0:13.56
out.256m.old-thponly.time:      System time (seconds): 0.45
out.256m.old-thponly.time:      Elapsed (wall clock) time (h:mm:ss or m:ss): 0:03.76

out.256m.new.time:      System time (seconds): 1.45
out.256m.new.time:      Elapsed (wall clock) time (h:mm:ss or m:ss): 0:15.12
out.256m.old.time:      System time (seconds): 2.08
out.256m.old.time:      Elapsed (wall clock) time (h:mm:ss or m:ss): 0:15.26

I guess this is expected as well. Sparse access doesn't amortize the
costly reclaim for each charged THP. On the other hand it can help a bit
if the whole mmap is populated.

If we compare fault latencies then we get the following:
- the worst latency [ms]:
out.256m.new-thponly 1991
out.256m.old-thponly 1838
out.256m.new 6197
out.256m.old 5538

- top 5 worst latencies (sum in [ms]):
out.256m.new-thponly 5694
out.256m.old-thponly 3168
out.256m.new 9498
out.256m.old 8291

- top 10
out.256m.new-thponly 7139
out.256m.old-thponly 3193
out.256m.new 11786
out.256m.old 9347

- top 100
out.256m.new-thponly 13035
out.256m.old-thponly 3434
out.256m.new 14634
out.256m.old 12881

I think this shows up that my concern about excessive reclaim and stalls
is real and it is worse when the memory is used sparsely. It is true it
might help when the whole THP section is used and so the additional cost
is amortized but the more sparsely each THP section is used the higher
overhead you are adding without userspace actually asking for it.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>