Re: Page allocator bottleneck

Tariq Toukan <tariqt@xxxxxxxxxxxx> · Sun, 17 Sep 2017 19:16:15 +0300

On 15/09/2017 10:28 AM, Jesper Dangaard Brouer wrote:
On Thu, 14 Sep 2017 19:49:31 +0300
Tariq Toukan <tariqt@xxxxxxxxxxxx> wrote:

Hi all,

As part of the efforts to support increasing next-generation NIC speeds,
I am investigating SW bottlenecks in network stack receive flow.

Here I share some numbers I got for a simple experiment, in which I
simulate the page allocation rate needed in 200Gpbs NICs.

Thanks for bringing this up again.

Sure. We need to keep up with the increasing NIC speeds.

I ran the test below over 3 different (modified) mlx5 driver versions,
loaded on server side (RX):
1) RX page cache disabled, 2 packets per page.

2 packets per page basically reduce the overhead you see from the page
allocator to half.

2) RX page cache disabled, one packet per page.

This, should stress the page allocator.

3) Huge RX page cache, one packet per page.

A driver level page-cache will look nice, as long as it "works".

I verified that it worked in the experiment.

Drivers usually have no other option than basing their recycle facility
to be based on the page-refcnt (as there is no destructor callback).
Which implies packets/pages need to be returned quickly enough for it
to work.

Yes, that's how our current default (small) RX page-cache is 
implemented. Unfortunately, the timing and terms for a fair reuse rate 
are not always satisfied.

All page allocations are of order 0.

NIC: Connectx-5 100 Gbps.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz

Test:
128 TCP streams (using super_netperf).
Changing num of RX queues.
HW LRO OFF, GRO ON, MTU 1500.

With TCP streams and GRO, is actually a good stress test for the page
allocator (or drivers page-recycle cache). As  Eric Dumazet have made
some nice optimizations, that (in most situations) cause us to quickly
free/recycle the SKB (coming from driver) and store the pages in 1-SKB.
This cause us to hit the SLUB fastpath for the SKBs, but once the pages
need to be free'ed this stress the page allocator more.

Yep, bulking would help here, as you mention below.

Also be aware that with TCP flows, the packets are likely delivered
into a socket, that is consumed on another CPU.  Thus, the pages are
allocated on one CPU and free'ed on another. AFAIK this stress the
order-0 cache PCP (Per-Cpu-Pages).

Good point.
Do you know of any tool/kernel counters that help observe and quantify 
this behavior?

Observe: BW as a function of num RX queues.

Results:

Driver #1:
#rings	BW (Mbps)
1	23,813
2	44,086
3	62,128
4	78,058
6	94,210 (linerate)
8	94,205 (linerate)
12	94,202 (linerate)
16	94,191 (linerate)

Driver #2:
#rings	BW (Mbps)
1	18,835
2	36,716
3	50,521
4	61,746
6	63,637
8	60,299
12	51,048
16	43,337

Driver #3:
#rings	BW (Mbps)
1	19,316
2	44,850
3	69,549
4	87,434
6	94,342 (linerate)
8	94,350 (linerate)
12	94,327 (linerate)
16	94,327 (linerate)

Insights:
Major degradation between #1 and #2, not getting any close to linerate!
Degradation is fixed between #2 and #3.
This is because page allocator cannot stand the higher allocation rate.
In #2, we also see that the addition of rings (cores) reduces BW (!!),
as result of increasing congestion over shared resources.

Congestion in this case is very clear.
When monitored in perf top:
85.58% [kernel] [k] queued_spin_lock_slowpath

Well, we obviously need to know the caller of the spin_lock.  In this
case it is likely the page allocator lock.  It could also be the TCP
socket locks, but given GRO is enabled, they should be hit much less.

It is the page allocator lock.
I verified this based on Andi's suggestion, see other mail.

It's nice to have the option to dynamically play with the parameter.
But maybe we should also think of changing the default fraction 
guaranteed to the PCP, so that unaware admins of networking servers 
would also benefit.

I think that page allocator issues should be discussed separately:
1) Rate: Increase the allocation rate on a single core.
2) Scalability: Reduce congestion and sync overhead between cores.

Yes, but this no small task.  I is on my TODO-list (emacs org-mode),
but I have other tasks that have higher priority atm.  I'll be working
on XDP_REDIRECT for the next many months.  Currently trying to convince
people that we do an explicit packet-page return/free callback (which
would avoid many of these issues).

This is clearly the current bottleneck in the network stack receive
flow.

I know about some efforts that were made in the past two years.
For example the ones from Jesper et al.:

- Page-pool (not accepted AFAIK).

The page-pool have many purposes.
  1. generic page-cache for drivers,
  2. keep pages DMA-mapped
  3. facilitate drivers to change RX-ring memory model

 From a MM-point-of-view the page pool is just a destructor callback,
that can "steal" the page.

If I can convince XDP_REDIRECT to use an explicit destructor callback,
then I almost get what I need.  Except for the generic part, and the
normal network path will not see the benefit.  Thus, not helping your
use-case, I guess.

I see.

- Page-allocation bulking.

Notice, that page-allocator bulking, would still be needed by the
page-pool and other page-cache facilities. We should implement it
regardless of the page_pool.

I agree.
It fits perfectly with our Striding RQ feature, in which each RX 
descriptor is relatively large and serves multiple received packets, 
requiring the allocation of many order-0 pages.

Without a page pool facility to hide the use of page bulking.  You
could use page-bulk-alloc in driver RX-ring refill, and find where TCP
free the GRO packets, and do page-bulk-free there.

Exactly.

- Optimize order-0 allocations in Per-Cpu-Pages.

There is a need to optimize PCP some more for the single-core XDP
performance target (~14Mpps).  I guess, the easiest way around this is
implement/integrate a page bulk API into PCP.

The TCP-GRO use-case you are hitting is a different bottleneck.
It is a multi-CPU parallel workload, that exceed the PCP cache size,
and cause you to hit the page buddy allocator.

Indeed, I verified that.

I wonder if you could "solve"/mitigate the issue if you tune the size
of the PCP cache?
AFAIK it only keeps 128 pages cached per CPU... I know you can see this
via a proc file, but I cannot remember which(?).  And I'm not sure how
you tune this(?)

/proc/sys/vm/percpu_pagelist_fraction

I am not an mm expert, but wanted to raise the issue again, to combine
the efforts and hear from you guys about status and possible directions.

Regarding recent changes... if you have you kernel compiled with
CONFIG_NUMA then the page-allocator is slower (due to keeping
Yes it is.

numa-stats), except that this was recently optimized and merged(?)

Sounds useful, I should get familiar with these stats.
Do you how to observe them?

What (exact) kernel git tree did you run these tests on?

I had a few mlx5 driver patches on top of:
96e5ae4e76f1 bpf: fix numa_node validation

Many thanks!

Regards,
Tariq

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>