Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range

Daniel Vetter <daniel@xxxxxxxx> · Mon, 6 May 2024 15:04:15 +0200

On Sat, May 04, 2024 at 11:03:03AM +1000, Dave Airlie wrote:
> > Let me know if this understanding is correct.
> >
> > Or what would you like to do in such situation?
> >
> > >
> > > Not sure how it is really a good idea.
> > >
> > > Adaptive locality of memory is still an unsolved problem in Linux,
> > > sadly.
> > >
> > > > > However, the migration stuff should really not be in the driver
> > > > > either. That should be core DRM logic to manage that. It is so
> > > > > convoluted and full of policy that all the drivers should be working
> > > > > in the same way.
> > > >
> > > > Completely agreed. Moving migration infrastructures to DRM is part
> > > > of our plan. We want to first prove of concept with xekmd driver,
> > > > then move helpers, infrastructures to DRM. Driver should be as easy
> > > > as implementation a few callback functions for device specific page
> > > > table programming and device migration, and calling some DRM common
> > > > functions during gpu page fault.
> > >
> > > You'd be better to start out this way so people can look at and
> > > understand the core code on its own merits.
> >
> > The two steps way were agreed with DRM maintainers, see here:  https://lore.kernel.org/dri-devel/SA1PR11MB6991045CC69EC8E1C576A715925F2@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/, bullet 4)
> 
> After this discussion and the other cross-device HMM stuff I think we
> should probably push more for common up-front, I think doing this in a
> driver without considering the bigger picture might not end up
> extractable, and then I fear the developers will just move onto other
> things due to management pressure to land features over correctness.
> 
> I think we have enough people on the list that can review this stuff,
> and even if the common code ends up being a little xe specific,
> iterating it will be easier outside the driver, as we can clearly
> demark what is inside and outside.

tldr; Yeah concurring.

I think like with the gpu vma stuff we should at least aim for the core
data structures, and more importantly, the locking design and how it
interacts with core mm services to be common code.

I read through amdkfd and I think that one is warning enough that this
area is one of these cases where going with common code aggressively is
much better. Because it will be buggy in terribly "how do we get out of
this design corner again ever?" ways no matter what. But with common code
there will at least be all of dri-devel and hopefully some mm folks
involved in sorting things out.

Most other areas it's indeed better to explore the design space with a few
drivers before going with common code, at the cost of having some really
terrible driver code in upstream. But here the cost of some really bad
design in drivers is just too expensive imo.
-Sima
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch