Re: [PATCH v5 13/17] platform/x86/amd/pmf: Add PMF-AMDGPU get interface

Hans de Goede <hdegoede@xxxxxxxxxx> · Fri, 17 Nov 2023 12:31:36 +0100

Hi,

On 11/17/23 12:08, Shyam Sundar S K wrote:
> Adding AMDGPU folks (Alex and Christian) - I had to drop them as the
> changes from amdgpu were moved to PMF driver.
> 
> Hi Hans,
> 
> 
> On 11/17/2023 4:18 PM, Hans de Goede wrote:
>> Hi Shyam,
>>
>> On 11/17/23 09:07, Shyam Sundar S K wrote:
>>> In order to provide GPU inputs to TA for the Smart PC solution to work, we
>>> need to have interface between the PMF driver and the AMDGPU driver.
>>>
>>> Add the initial code path for get interface from AMDGPU.
>>>
>>> Co-developed-by: Mario Limonciello <mario.limonciello@xxxxxxx>
>>> Signed-off-by: Mario Limonciello <mario.limonciello@xxxxxxx>
>>> Signed-off-by: Shyam Sundar S K <Shyam-sundar.S-k@xxxxxxx>

<snip>

>>> diff --git a/drivers/platform/x86/amd/pmf/tee-if.c b/drivers/platform/x86/amd/pmf/tee-if.c
>>> index 81b1bd356e83..82ee2b1c627f 100644
>>> --- a/drivers/platform/x86/amd/pmf/tee-if.c
>>> +++ b/drivers/platform/x86/amd/pmf/tee-if.c
>>> @@ -10,6 +10,7 @@
>>>  
>>>  #include <linux/debugfs.h>
>>>  #include <linux/tee_drv.h>
>>> +#include <linux/thermal.h>
>>>  #include <linux/uuid.h>
>>>  #include "pmf.h"
>>>  
>>> @@ -422,6 +423,60 @@ static void amd_pmf_tee_deinit(struct amd_pmf_dev *dev)
>>>  	tee_client_close_context(dev->tee_ctx);
>>>  }
>>>  
>>> +static int amd_pmf_gpu_get_cur_state(struct thermal_cooling_device *cooling_dev,
>>> +				     unsigned long *state)
>>> +{
>>> +	struct backlight_device *bd;
>>> +
>>> +	if (acpi_video_get_backlight_type() != acpi_backlight_native)
>>> +		return -ENODEV;
>>> +
>>> +	bd = backlight_device_get_by_type(BACKLIGHT_RAW);
>>> +	if (!bd)
>>> +		return -ENODEV;
>>> +
>>> +	*state = backlight_get_brightness(bd);
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +static int amd_pmf_gpu_get_max_state(struct thermal_cooling_device *cooling_dev,
>>> +				     unsigned long *state)
>>> +{
>>> +	struct backlight_device *bd;
>>> +
>>> +	if (acpi_video_get_backlight_type() != acpi_backlight_native)
>>> +		return -ENODEV;
>>> +
>>> +	bd = backlight_device_get_by_type(BACKLIGHT_RAW);
>>> +	if (!bd)
>>> +		return -ENODEV;
>>> +
>>> +	if (backlight_is_blank(bd))
>>> +		*state = 0;
>>> +	else
>>> +		*state = bd->props.max_brightness;
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +static const struct thermal_cooling_device_ops bd_cooling_ops = {
>>> +	.get_max_state = amd_pmf_gpu_get_max_state,
>>> +	.get_cur_state = amd_pmf_gpu_get_cur_state,
>>> +};
>>
>> This is still the wrong thing to do. The new PMF code MUST only
>> be a *consumer* of the thermal_cooling_device API.
>>
>> The thermal-cooling device for the backlight itself MUST be
>> registered by the amdgpu driver.
>>
>> I believe that the correct fix for this is to add a new flag to
>> the backlight_properties struct;
>> And then make backlight_device_register() register
>> a thermal_cooling_device for the backlight when this new flag is set.
>>
>> This way we get a nice generic way to use backlight class devices
>> as thermal cooling devices and you can make the amdgpu driver
>> register the thermal cooling device by simply adding the new flag
>> to the backlight_properties struct used in the amdgpu driver.
> 
> Agreed. I am also of the same opinion.

So the change to the AMDGPU driver here would just be setting
this one new flag in the backlight_properties struct.

Alex, Christian, would that be acceptable to you ?

>>> +static int amd_pmf_get_gpu_handle(struct pci_dev *pdev, void *data)
>>> +{
>>> +	struct amd_pmf_dev *dev = data;
>>> +
>>> +	if (pdev->vendor == PCI_VENDOR_ID_ATI && pdev->devfn == 0) {
>>> +		dev->gfx_data.gpu_dev = pdev;
>>> +		return 1; /* Stop walking */
>>> +	}
>>> +
>>> +	return 0; /* Continue walking */
>>> +}
>>> +
>>>  int amd_pmf_init_smart_pc(struct amd_pmf_dev *dev)
>>>  {
>>>  	int ret;
>>> @@ -433,10 +488,30 @@ int amd_pmf_init_smart_pc(struct amd_pmf_dev *dev)
>>>  	INIT_DELAYED_WORK(&dev->pb_work, amd_pmf_invoke_cmd);
>>>  	amd_pmf_set_dram_addr(dev);
>>>  	amd_pmf_get_bios_buffer(dev);
>>> +
>>>  	dev->prev_data = kzalloc(sizeof(*dev->prev_data), GFP_KERNEL);
>>>  	if (!dev->prev_data)
>>>  		return -ENOMEM;
>>>  
>>> +	pci_walk_bus(dev->root->bus, amd_pmf_get_gpu_handle, dev);
>>> +	if (dev->gfx_data.gpu_dev) {
>>> +		dev->drm_dev = pci_get_drvdata(dev->gfx_data.gpu_dev);
>>> +		if (!dev->drm_dev)
>>> +			return -EINVAL;
>>
>> You cannot just call pci_get_drvdata() on a device for which you
>> are not the driver. You have no idea of the lifetime of this device,
>> the driver may unbind and release the memory pci_get_drvdata()
>> points to, leaving this code with a dangling pointer which will
>> crash the kernel the first time you try to use it.
>>
>> Also since you are not the owner you MUST not assume any specific
>> type for this memory, you cannot be sure this actually is of
>> the type drm_device. Basically you MUST not touch another
>> driver's drvdata *at all*.
>>
>> The proper way to fix this would be to add some API to the DRM
>> subsystem to query the information you are looking for form
>> the DRM subsystem.
>>
>> Poking directly inside other driver's internals is NOT acceptable,
>> NACK for this patch.
>>
> 
> I am inline with your remarks, but I could find a way where the
> thermal driver registration, handling the backlight without having the
> changes in the amdgpu driver very tricky.

As mentioned above I think there should be generic thermal cooling
device support added to drivers/video/backlight/backlight.c, then
the amdgpu code just needs to pass a flag when registering
the backlight to enable this.

> Like you said, I am also of the same opinion that PMF driver should
> call the DRM/GPU subsystem APIs (like it does with other subsystems).
> 
> But Christian had concerns on adding all of these into the GPU driver.
> So I had to roll back these into the PMF driver, and hence you see a
> pci_get_drvdata() call.

Ok, so I can see how this AMD PMF code is all kinda special
and how the DRM folks don't want to have to add APIs just for
that. But IMHO calling pci_get_drvdata() on an unowned
device is completely unacceptable.

At a minimum we need life-cycle management for the drm_device
which the PMF code is using, something like:

struct drm_device *drm_device_find(const void *data,
   int (*match)(struct drm_device *dev, const void *data));

which works similar to class_find_device() and returns
a reference to the drm_device for which match has returned 0
(which also stops it from looping over devices).

Combined with a:

void drm_device_put(struct drm_device *dev);

for when the PMF code is done with the device.

This way the PMF code can safely get a reference to drm_device
and release it when it is done. Rather then just getting
some random pointer which may or not actually be a drm_device
and the lifetime of which is not guaranteed in anyway.

E.g. if the PMF driver loads before amdgpu then
pci_get_drvdata() will just return NULL.

And as mentioned if the amdgpu driver gets unbound after
the PMF code has called  pci_get_drvdata() the PMF driver
now has a dangling pointer.

So IMHO adding: drm_device_find() + drm_device_put()
to the DRM core are minimum which is necessary here.

If the PMF code then ends up doing things like
drm_for_each_connector_iter() on the gotten drm_device
referemce so be it. But we must make sure we have
a properly lifecycle managed reference first and
pci_get_drvdata() does not give us that.

> For the sake of simplicity, I can drop patches 13/17 and 14/17 from
> the series and send them separately later.

Yes I think that dropping the GPU related patches for
now would be best.

Regards,

Hans