On Wed, Aug 28, 2024 at 05:35:45PM +0200, Andi Shyti wrote: > Hi Sima, > > On Wed, Aug 28, 2024 at 03:47:21PM +0200, Daniel Vetter wrote: > > On Wed, Aug 28, 2024 at 10:20:15AM +0200, Andi Shyti wrote: > > > Hi Sima, > > > > > > first of all, thanks for looking into this series. > > > > > > On Tue, Aug 27, 2024 at 07:31:21PM +0200, Daniel Vetter wrote: > > > > On Fri, Aug 23, 2024 at 03:08:40PM +0200, Andi Shyti wrote: > > > > > Hi, > > > > > > > > > > This patch series introduces static load balancing for GPUs with > > > > > multiple compute engines. It's a lengthy series, and some > > > > > challenging aspects still need to be resolved. > > > > > > > > Do we have an actual user for this, where just reloading the entire driver > > > > (or well-rebinding, if you only want to change the value for a specific > > > > device) with a new module option isn't enough? > > > > > > Yes, we have users for this and this has been already agreed with > > > architects and maintainers. > > > > So my understanding is that for upstream, this only applies to dg2, > > because the other platforms don't have enough CCS engines to make this a > > real issue. > > > > Do we really have upstream demand for this feature on dg2 only? > > That's my understanding. > > > Also how hard would it be to make these users happy with xe-on-dg2 in > > upstream instead? > > I don't know this, I think the user is already on i915. > > > > Why are you saying that we are reloading/rebinding the driver? > > > > That's the other alternate solution. > > But that's not how XE does it, though. > > The use case is that userspace has an environment variable that > they change ondemand for choosing the CCS mode. They want to > change the value of that variable on the fly and, as we are only > adding or removing a few engines, this is done without reprobing > the whole driver. > > In a previous implementation (from where both I and Niranjana for > XE took inspiration) the CCS mode was passed during compute > execbuf. > > > > I'm only removing the exposure of user engines, which is > > > basically a flag in the engines data structure. > > > > > > > There's some really gnarly locking and lifetime fun in there, and it needs > > > > a corresponding justification. > > > > > > What locking are you referring about? > > > > > > I only added one single mutex that has a comment and a > > > justification. If you think that's not enough, I can of course > > > improve it (please note that the changes have a good amount of > > > comments and I tried to be aso more descriptive as I could). > > > > > > When I change the engines configurations only for the compute > > > engines and only for DG2 platforms, I need to make sure that no > > > other user is affected by the change. Thus I need to make sure > > > that access to some of the strucures are properly serialized. > > > > > > > Which needs to be enormous for this case, > > > > meaning actual customers willing to shout on dri-devel that they really, > > > > absolutely need this, or their machines will go up in flames. > > > > Otherwise this is a nack from me. > > > > > > Would you please tell me why are you nacking the patch? So that I > > > address your comments for v4? > > > > So for one, this is substantially more flexible than the solution merged > > into xe. And the patch set doesn't explain why (the commit messages > > actualy describe the design xe has). > > I think in XE we might have missed a few things and my plan is to > check the XE implementation once I'm done with i915 (I was one of > the XE reviewers). And, many of the things in XE are so different > that the solution can't be taken as it is. Please fix XE first. XE is supposed to lead here with all new platform support and uapi additions, i915-gem is supposed to be completely in feature freeze. And exceptions need really good reasons. > > That does not inspire confidence at all. > > Consider that most of the patches are refactoring, only the last > patch is doing the real job. That's because the first workaround > was already merged a while ago. While XE didn't need the > refactorings I made. > > > Second, I don't think anyone understands the entire engine/ctx locking > > design in i915-gem. And the fix for that was to make as much as absolutely > > possible immutable. Yes the implementation looks correct, but when I > > looked at the much, much simpler xe implementation I'm pretty sure I've > > found an issue there too. Here I can't even tell. > > The locking is fairly simple, when the user wants to set a > specific CCS mode, I take the wakrefe lock and I check no one is > holding it. This way I am sure that I am the only user of the GPU > (otherwise the GPU would be off). > > I added one single lock to be used for the for_each_uabi_engine. > It's not really required but I really want to be sure that I am > not changing the CCS mode while someone else is using the uabi > engines. > > I'm also adding Joonas in Cc with whom I discussed many details > of the implementation. I would really appreaciate to know what > exactly is wrong here and what are the necessary changes needed > to get the series merged. This is i915-gem. Locking is never simple here, and I think that entire journey on the mmap confusion shows how little people understand how much complexity and confusion and i915-gem being special there really is. You're really not making a good case that you understand this stuff here, which really doesn't make me enthusiastic about considering an uapi extension for i915-gem. Cheers, Sima -- Simona Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch