Re: [RFC 00/29] Introduce NVIDIA GPU Virtualization (vGPU) Support

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]<

 



I hope and expect the nova and vgpu_mgr efforts to ultimately converge.

First, for the fw ABI debacle: yes, it is unfortunate that we still don't
have a stable ABI from GSP.  We /are/ working on it, though there isn't
anything to show, yet.  FWIW, I expect the end result will be a much
simpler interface than what is there today, and a stable interface that
NVIDIA can guarantee.

But, for now, we have a timing problem like Jason described:

- We have customers eager for upstream vfio support in the near term,
  and that seems like something NVIDIA can develop/contribute/maintain in
  the near term, as an incremental step forward.

- Nova is still early in its development, relative to nouveau/nvkm.

- From NVIDIA's perspective, we're nervous about the backportability of
  rust-based components to enterprise kernels in the near term.

- The stable GSP ABI is not going to be ready in the near term.


I agree with what Dave said in one of the forks of this thread, in the context of
NV2080_CTRL_VGPU_MGR_INTERNAL_BOOTLOAD_GSP_VGPU_PLUGIN_TASK_PARAMS:

> The GSP firmware interfaces are not guaranteed stable. Exposing these
> interfaces outside the nvkm core is unacceptable, as otherwise we
> would have to adapt the whole kernel depending on the loaded firmware.
>
> You cannot use any nvidia sdk headers, these all have to be abstracted
> behind things that have no bearing on the API.

Agreed.  Though not infinitely scalable, and not
as clean as in rust, it seems possible to abstract
NV2080_CTRL_VGPU_MGR_INTERNAL_BOOTLOAD_GSP_VGPU_PLUGIN_TASK_PARAMS behind
a C-implemented abstraction layer in nvkm, at least for the short term.

Is there a potential compromise where vgpu_mgr starts its life with a
dependency on nvkm, and as things mature we migrate it to instead depend
on nova?


On Thu, Sep 26, 2024 at 11:40:57AM -0300, Jason Gunthorpe wrote:
> On Thu, Sep 26, 2024 at 02:54:38PM +0200, Greg KH wrote:
> 
> > That's fine, but again, do NOT make design decisions based on what you
> > can, and can not, feel you can slide by one of these companies to get it
> > into their old kernels.  That's what I take objection to here.
> 
> It is not slide by. It is a recognition that participating in the
> community gives everyone value. If you excessively deny value from one
> side they will have no reason to participate.
> 
> In this case the value is that, with enough light work, the
> kernel-fork community can deploy this code to their users. This has
> been the accepted bargin for a long time now.
> 
> There is a great big question mark over Rust regarding what impact it
> actually has on this dynamic. It is definitely not just backport a few
> hundred upstream patches. There is clearly new upstream development
> work needed still - arch support being a very obvious one.
> 
> > Also always remember please, that the % of overall Linux kernel
> > installs, even counting out Android and embedded, is VERY tiny for these
> > companies.  The huge % overall is doing the "right thing" by using
> > upstream kernels.  And with the laws in place now that % is only going
> > to grow and those older kernels will rightfully fall away into even
> > smaller %.
> 
> Who is "doing the right thing"? That is not what I see, we sell
> server HW to *everyone*. There are a couple sites that are "near"
> upstream, but that is not too common. Everyone is running some kind of
> kernel fork.
> 
> I dislike this generalization you do with % of users. Almost 100% of
> NVIDIA server HW are running forks. I would estimate around 10% is
> above a 6.0 baseline. It is not tiny either, NVIDIA sold like $60B of
> server HW running Linux last year with this kind of demographic. So
> did Intel, AMD, etc.
> 
> I would not describe this as "VERY tiny". Maybe you mean RHEL-alike
> specifically, and yes, they are a diminishing install share. However,
> the hyperscale companies more than make up for that with their
> internal secret proprietary forks :(
> 
> > > Otherwise, let's slow down here. Nova is still years away from being
> > > finished. Nouveau is the in-tree driver for this HW. This series
> > > improves on Nouveau. We are definitely not at the point of refusing
> > > new code because it is not writte in Rust, RIGHT?
> > 
> > No, I do object to "we are ignoring the driver being proposed by the
> > developers involved for this hardware by adding to the old one instead"
> > which it seems like is happening here.
> 
> That is too harsh. We've consistently taken a community position that
> OOT stuff doesn't matter, and yes that includes OOT stuff that people
> we trust and respect are working on. Until it is ready for submission,
> and ideally merged, it is an unknown quantity. Good well meaning
> people routinely drop their projects, good projects run into
> unexpected roadblocks, and life happens.
> 
> Nova is not being ignored, there is dialog, and yes some disagreement.
> 
> Again, nobody here is talking about disrupting Nova. We just want to
> keep going as-is until we can all agree together it is ready to make a
> change.
> 
> Jason



[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux