Re: [RFC] Make use of non-dynamic dmabuf in RDMA

Christian König <christian.koenig@xxxxxxx> · Wed, 25 Aug 2021 17:14:06 +0200

Am 25.08.21 um 16:47 schrieb Jason Gunthorpe:
On Wed, Aug 25, 2021 at 03:51:14PM +0200, Christian König wrote:
Am 25.08.21 um 14:38 schrieb Jason Gunthorpe:
On Wed, Aug 25, 2021 at 02:27:08PM +0200, Christian König wrote:
Am 25.08.21 um 14:18 schrieb Jason Gunthorpe:
On Wed, Aug 25, 2021 at 08:17:51AM +0200, Christian König wrote:

The only real option where you could do P2P with buffer pinning are those
compute boards where we know that everything is always accessible to
everybody and we will never need to migrate anything. But even then you want
some mechanism like cgroups to take care of limiting this. Otherwise any
runaway process can bring down your whole system.
Why? It is not the pin that is the problem, it was allocating GPU
dedicated memory in the first place. pinning it just changes the
sequence to free it. No different than CPU memory.
Pinning makes the memory un-evictable.

In other words as long as we don't pin anything we can support as many
processes as we want until we run out of swap space. Swapping sucks badly
because your applications become pretty much unuseable, but you can easily
recover from it by killing some process.

With pinning on the other hand somebody sooner or later receives an -ENOMEM
or -ENOSPC and there is no guarantee that this goes to the right process.
It is not really different - you have the same failure mode once the
system runs out of swap.

This is really the kernel side trying to push a policy to the user
side that the user side doesn't want..
But which is still the right thing to do as far as I can see. See userspace
also doesn't want proper process isolation since it takes extra time.
Why? You are pushing a policy of resource allocation/usage which
more properly belongs in userspace.

Kernel development is driven by exposing the hardware functionality in a
save and manageable manner to userspace, and not by fulfilling userspace
requirements.
I don't agree with this, that is a 1980's view of OS design. So much
these days in the kernel is driven entirely by boutique userspace
requirements and is very much not about the classical abstract role of
an OS.

But it's still true never the less. Otherwise you would have libraries 
for filesystem accesses and no system security to speak of.

Dedicated systems are a significant use case here and should be
supported, even if the same solution wouldn't be applicable to someone
running a desktop.
And exactly that approach is not acceptable.
We have knobs and settings all over the place to allow Linux to
support a broad set of use cases from Android to servers, to HPC. So
long as they can co-exist and the various optional modes do not
critically undermine the security of the kernel, it is well in line
with how things have been evolving in the last 15 years.

Yeah, that's exactly what I'm talking about by adding cgroup or similar. 
You need a knob to control this.

Here you are talking entirely about policy to control memory
allocation, which is already well trodden ground for CPU memory.

There are now endless boutique ways to deal with this, it is a very
narrow view to say that GPU memory is so special and different that
only one way can be the correct/allowed way.

Well I'm not talking about GPU memory in particular here. This is 
mandatory for any memory or saying more general any resource.

E.g. you are not allowed to pin large amount of system memory on a 
default installation for exactly those reasons as well.

That you can have a knob to disable this behavior for your HPC system is 
perfectly fine, but I thing what Dave notes here as well that this is 
most likely not the default behavior we want.

Christian.

Jason