Re: [PATCH net-next] net/smc: Allocate pages of SMC-R on ibdev NUMA node

Leon Romanovsky <leon@xxxxxxxxxx> · Tue, 8 Feb 2022 11:32:23 +0200

On Tue, Feb 08, 2022 at 10:10:55AM +0100, Stefan Raspl wrote:
> On 2/7/22 14:49, Leon Romanovsky wrote:
> > On Mon, Feb 07, 2022 at 05:59:58PM +0800, Tony Lu wrote:
> > > On Mon, Jan 31, 2022 at 09:20:52AM +0200, Leon Romanovsky wrote:
> > > > On Mon, Jan 31, 2022 at 03:03:00AM +0800, Tony Lu wrote:
> > > > > Currently, pages are allocated in the process context, for its NUMA node
> > > > > isn't equal to ibdev's, which is not the best policy for performance.
> > > > > 
> > > > > Applications will generally perform best when the processes are
> > > > > accessing memory on the same NUMA node. When numa_balancing enabled
> > > > > (which is enabled by most of OS distributions), it moves tasks closer to
> > > > > the memory of sndbuf or rmb and ibdev, meanwhile, the IRQs of ibdev bind
> > > > > to the same node usually. This reduces the latency when accessing remote
> > > > > memory.
> > > > 
> > > > It is very subjective per-specific test. I would expect that
> > > > application will control NUMA memory policies (set_mempolicy(), ...)
> > > > by itself without kernel setting NUMA node.
> > > > 
> > > > Various *_alloc_node() APIs are applicable for in-kernel allocations
> > > > where user can't control memory policy.
> > > > 
> > > > I don't know SMC-R enough, but if I judge from your description, this
> > > > allocation is controlled by the application.
> > > 
> > > The original design of SMC doesn't handle the memory allocation of
> > > different NUMA node, and the application can't control the NUMA policy
> > > in SMC.
> > > 
> > > It allocates memory according to the NUMA node based on the process
> > > context, which is determined by the scheduler. If application process
> > > runs on NUMA node 0, SMC allocates on node 0 and so on, it all depends
> > > on the scheduler. If RDMA device is attached to node 1, the process runs
> > > on node 0, it allocates memory on node 0.
> > > 
> > > This patch tries to allocate memory on the same NUMA node of RDMA
> > > device. Applications can't know the current node of RDMA device. The
> > > scheduler knows the node of memory, and can let applications run on the
> > > same node of memory and RDMA device.
> > 
> > I don't know, everything explained above is controlled through memory
> > policy, where application needs to run on same node as ibdev.
> 
> The purpose of SMC-R is to provide a drop-in replacement for existing TCP/IP
> applications. The idea is to avoid almost any modification to the
> application, just switch the address family. So while what you say makes a
> lot of sense for applications that intend to use RDMA, in the case of SMC-R
> we can safely assume that most if not all applications running it assume
> they get connectivity through a non-RDMA NIC. Hence we cannot expect the
> applications to think about aspects such as NUMA, and we should do the right
> thing within SMC-R.

And here comes the problem, you are doing the right thing for very
specific and narrow use case, where application and ibdev run on
same node. It is not true for multi-core systems as application will
be scheduled on less load node (in very simplistic form).

In general case, the application will get CPU and memory based on scheduler
heuristic as you don't use memory policy to restrict it. The assumption
that allocations need to be close to ibdev and not to applications can
lead to worse performance.

Thanks

> 
> Ciao,
> Stefan