On Thu, Sep 12, 2024 at 11:58:37AM GMT, Bommu, Krishnaiah wrote:
-----Original Message-----
From: De Marchi, Lucas <lucas.demarchi@xxxxxxxxx>
Sent: Wednesday, September 11, 2024 9:49 PM
To: Bommu, Krishnaiah <krishnaiah.bommu@xxxxxxxxx>
Cc: Vivi, Rodrigo <rodrigo.vivi@xxxxxxxxx>; intel-xe@xxxxxxxxxxxxxxxxxxxxx; intel-
gfx@xxxxxxxxxxxxxxxxxxxxx; Kamil Konieczny <kamil.konieczny@xxxxxxxxxxxxxxx>;
Ceraolo Spurio, Daniele <daniele.ceraolospurio@xxxxxxxxx>; Upadhyay, Tejas
<tejas.upadhyay@xxxxxxxxx>; Tvrtko Ursulin <tursulin@xxxxxxxxxxx>; Joonas
Lahtinen <joonas.lahtinen@xxxxxxxxxxxxxxx>; Nikula, Jani
<jani.nikula@xxxxxxxxx>; Thomas Hellström
<thomas.hellstrom@xxxxxxxxxxxxxxx>; Teres Alexis, Alan Previn
<alan.previn.teres.alexis@xxxxxxxxx>; Winkler, Tomas
<tomas.winkler@xxxxxxxxx>; Usyskin, Alexander
<alexander.usyskin@xxxxxxxxx>; linux-modules@xxxxxxxxxxxxxxx; Luis
Chamberlain <mcgrof@xxxxxxxxxx>
Subject: Re: [PATCH v2] drm: Ensure Proper Unload/Reload Order of MEI
Modules for i915/Xe Driver
+ linux-modules
+ Luis
On Wed, Sep 11, 2024 at 01:00:47AM GMT, Bommu, Krishnaiah wrote:
>
>
>> -----Original Message-----
>> From: De Marchi, Lucas <lucas.demarchi@xxxxxxxxx>
>> Sent: Tuesday, September 10, 2024 9:13 PM
>> To: Vivi, Rodrigo <rodrigo.vivi@xxxxxxxxx>
>> Cc: Bommu, Krishnaiah <krishnaiah.bommu@xxxxxxxxx>; intel-
>> xe@xxxxxxxxxxxxxxxxxxxxx; intel-gfx@xxxxxxxxxxxxxxxxxxxxx; Kamil
>> Konieczny <kamil.konieczny@xxxxxxxxxxxxxxx>; Ceraolo Spurio, Daniele
>> <daniele.ceraolospurio@xxxxxxxxx>; Upadhyay, Tejas
>> <tejas.upadhyay@xxxxxxxxx>; Tvrtko Ursulin <tursulin@xxxxxxxxxxx>;
>> Joonas Lahtinen <joonas.lahtinen@xxxxxxxxxxxxxxx>; Nikula, Jani
>> <jani.nikula@xxxxxxxxx>; Thomas Hellström
>> <thomas.hellstrom@xxxxxxxxxxxxxxx>; Teres Alexis, Alan Previn
>> <alan.previn.teres.alexis@xxxxxxxxx>; Winkler, Tomas
>> <tomas.winkler@xxxxxxxxx>; Usyskin, Alexander
>> <alexander.usyskin@xxxxxxxxx>
>> Subject: Re: [PATCH v2] drm: Ensure Proper Unload/Reload Order of MEI
>> Modules for i915/Xe Driver
>>
>> On Tue, Sep 10, 2024 at 11:03:30AM GMT, Rodrigo Vivi wrote:
>> >On Mon, Sep 09, 2024 at 09:33:17AM +0530, Bommu Krishnaiah wrote:
>> >> This update addresses the unload/reload sequence of MEI modules in
>> >> relation to the i915/Xe graphics driver. On platforms where the
>> >> MEI hardware is integrated with the graphics device (e.g.,
>> >> DG2/BMG), the i915/xe driver is depend on the MEI modules.
>> >> Conversely, on newer platforms like MTL and LNL, where the MEI
>> >> hardware is separate, this
>> dependency does not exist.
>> >>
>> >> The changes introduced ensure that MEI modules are unloaded and
>> >> reloaded in the correct order based on platform-specific
>> >> dependencies. This is achieved by adding a MODULE_SOFTDEP
>> >> directive to
>> the i915 and Xe module code.
>>
>>
>> can you explain what causes the modules to be loaded today? Also, is
>> this to fix anything related to *loading* order or just unload?
>>
>> >>
>> >> These changes enhance the robustness of MEI module handling across
>> >> different hardware platforms, ensuring that the i915/Xe driver can
>> >> be cleanly unloaded and reloaded without issues.
>> >>
>> >> v2: updated commit message
>> >>
>> >> Signed-off-by: Bommu Krishnaiah <krishnaiah.bommu@xxxxxxxxx>
>> >> Cc: Kamil Konieczny <kamil.konieczny@xxxxxxxxxxxxxxx>
>> >> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@xxxxxxxxx>
>> >> Cc: Lucas De Marchi <lucas.demarchi@xxxxxxxxx>
>> >> Cc: Tejas Upadhyay <tejas.upadhyay@xxxxxxxxx>
>> >> ---
>> >> drivers/gpu/drm/i915/i915_module.c | 2 ++
>> >> drivers/gpu/drm/xe/xe_module.c | 2 ++
>> >> 2 files changed, 4 insertions(+)
>> >>
>> >> diff --git a/drivers/gpu/drm/i915/i915_module.c
>> >> b/drivers/gpu/drm/i915/i915_module.c
>> >> index 65acd7bf75d0..2ad079ad35db 100644
>> >> --- a/drivers/gpu/drm/i915/i915_module.c
>> >> +++ b/drivers/gpu/drm/i915/i915_module.c
>> >> @@ -75,6 +75,8 @@ static const struct { }; static int
>> >> init_progress;
>> >>
>> >> +MODULE_SOFTDEP("pre: mei_gsc_proxy mei_gsc");
>> >> +
>> >> static int __init i915_init(void) {
>> >> int err, i;
>> >> diff --git a/drivers/gpu/drm/xe/xe_module.c
>> >> b/drivers/gpu/drm/xe/xe_module.c index bfc3deebdaa2..5633ea1841b7
>> >> 100644
>> >> --- a/drivers/gpu/drm/xe/xe_module.c
>> >> +++ b/drivers/gpu/drm/xe/xe_module.c
>> >> @@ -127,6 +127,8 @@ static void xe_call_exit_func(unsigned int i)
>> >> init_funcs[i].exit();
>> >> }
>> >>
>> >> +MODULE_SOFTDEP("pre: mei_gsc_proxy mei_gsc");
>> >
>> >I'm honestly not very comfortable with this.
>> >
>> >1. This is not true for every device supported by these modules.
>> >2. This is not true for every (and the most basic) functionality of these
drivers.
>> >
>> >Shouldn't this be done in the the mei side?
>>
>> I don't think it's possible to do from the mei side. Would mei depend
>> on both xe and i915 (and thus cause both to be loaded regardless of
>> the platform?). For a runtime dependency like this that depends on
>> the platform, I think the best way would be a weakdep + either a
>> request_module() or something else that causes the module to load (is
>> that what comp_* is doing today?)
>>
>> >
>> >Couldn't at probe we identify the need of them and if needed we
>> >return -EPROBE to attempt a retry after the mei drivers were probed?
>>
>> I'm not sure this is fixing anything for probe. I think we already
>> wait on the other component to be ready without blocking the rest of the
driver functionality.
>>
>> A weakdep wouldn't cause the module to be loaded where it's not
>> needed, but need some clarification if this is trying to fix anything load-
related or just unload.
>
>This change is fixing unload.
>During xe load I am seeing mei_gsc modules was loaded, but not unloaded
>during the unload xe
so, first thing: if things are correct in the kernel, we shouldn't need to
**unload** the module after unbinding the device. Why are we unloading xe
and the other modules for tests?
While running gta@xe_module_load@reload-no-display I see failure, to address this failure I have this changes, previously I am trying to fix from IGT, but as per igt review suggestion I am trying to fix issue in kernel,
IGT patch: https://patchwork.freedesktop.org/series/137343/
it seems a mistake in igt to try to remove the mei_gsc module.
As a dgfx, it's even worse - what happens if another card is using the
module? What happens if I have a RPL + BMG and i915 driving the former
while xe drives the latter?
You shouldn't need to remove it. This works for me with BMG (unbinding
all drivers for simplicity since we are removing the module... but if we
don't remove the module, then we can test with the only device we care
about):
# modprobe xe
# unbind
Unbinding /sys/bus/pci/devices/0000:00:02.0 (8086:a782)... ok
Unbinding /sys/bus/pci/devices/0000:03:00.0 (8086:e20b)... ok
# lsmod | grep -e xe -e mei_gsc
xe 3584000 0
drm_gpuvm 45056 1 xe
video 77824 1 xe
i2c_algo_bit 12288 1 xe
drm_ttm_helper 16384 1 xe
gpu_sched 61440 1 xe
drm_suballoc_helper 16384 1 xe
drm_display_helper 270336 1 xe
drm_kunit_helpers 16384 1 xe
drm_buddy 20480 1 xe
ttm 114688 2 drm_ttm_helper,xe
mei_gsc_proxy 16384 0
mei_gsc 12288 0
drm_exec 16384 2 drm_gpuvm,xe
kunit 73728 2 xe,drm_kunit_helpers
drm_kms_helper 241664 4 drm_display_helper,drm_ttm_helper,xe,drm_kunit_helpers
mei_me 65536 3 mei_gsc
mei 167936 7 mei_gsc_proxy,mei_gsc,mei_hdcp,mei_pxp,mei_me
drm 737280 11 gpu_sched,drm_kms_helper,drm_exec,drm_gpuvm,drm_suballoc_helper,drm_display_helper,drm_buddy,drm_ttm_helper,xe,drm_kunit_helpers,ttm
# modprobe -r xe
# modprobe xe probe_display=0
# unbind
Unbinding /sys/bus/pci/devices/0000:00:02.0 (8086:a782)... ok
Unbinding /sys/bus/pci/devices/0000:03:00.0 (8086:e20b)... ok
# modprobe -r xe
# modprobe xe
I didn't check if mei_gsc continues to work after reload, but I guess so
as its refcount is incremented:
mei_gsc 12288 1
unbind function is this:
function unbind {
vga="0300"
display="0380"
pci_vendor="8086"
while read -r pci_slot class devid xxx; do
sysdev=/sys/bus/pci/devices/0000:$pci_slot
echo -n "Unbinding $sysdev ($devid)... "
if [ ! -e "$sysdev/driver" ]; then
echo "(skip: not bound)"
continue
fi
echo -n auto > ${sysdev}/power/control
echo -n "0000:$pci_slot" > $sysdev/driver/unbind
echo "ok"
done <<<$(lspci -d ${pci_vendor}::${display} -n; lspci -d ${pci_vendor}::${vga} -n )
}
So... for igt: I *think* simply removing the array with modules to
unload first would fix it.
Lucas De Marchi
>root@DUT6127BMGFRD:/home/gta# lsmod | grep xe ------>>>just after
>system reboot root@DUT6127BMGFRD:/home/gta#
>root@DUT6127BMGFRD:/home/gta# lsmod | grep mei
>mei_hdcp 28672 0
>mei_pxp 16384 0
>mei_me 49152 2
>mei 167936 5 mei_hdcp,mei_pxp,mei_me
>root@DUT6127BMGFRD:/home/gta# lsmod | grep xe
>root@DUT6127BMGFRD:/home/gta# root@DUT6127BMGFRD:/home/gta#
modprobe xe
>root@DUT6127BMGFRD:/home/gta# root@DUT6127BMGFRD:/home/gta#
lsmod |
>grep mei
>mei_gsc_proxy 16384 0
>mei_gsc 12288 1
^ which means there's one user, which
should be xe
>mei_hdcp 28672 0
>mei_pxp 16384 0
>mei_me 49152 3 mei_gsc
>mei 167936 8 mei_gsc_proxy,mei_gsc,mei_hdcp,mei_pxp,mei_me
>root@DUT6127BMGFRD:/home/gta#
>root@DUT6127BMGFRD:/home/gta#
>root@DUT6127BMGFRD:/home/gta#
>root@DUT6127BMGFRD:/home/gta# init 3
>root@DUT6127BMGFRD:/home/gta# echo -n auto >
>/sys/bus/pci/devices/0000\:03\:00.0/power/control
>root@DUT6127BMGFRD:/home/gta# echo -n "0000:03:00.0" >
>/sys/bus/pci/drivers/xe/unbind root@DUT6127BMGFRD:/home/gta#
modprobe
>-r xe root@DUT6127BMGFRD:/home/gta#
root@DUT6127BMGFRD:/home/gta# lsmod
>| grep xe root@DUT6127BMGFRD:/home/gta# lsmod | grep mei
>mei_gsc_proxy 16384 0
>mei_gsc 12288 0
^ great, so the refcount went to 0,
confirming it was xe. It should go to 0
even before you unload the module,
when unbind.
A couple of points:
1) why do we care about unloading mei_gsc. Just loading xe
again (or even not even unloading it, just unbind/rebind),
should still work if the xe <-> mei_gsc integration is done
correctly.
2) If for some reason we do want to remove the module, then we will
need some work in kernel/module/ to start tracking runtime module
dependencies, i.e. when one module does a module_get(foo->owner), it
would add to a list and output on sysfs together with the holders list.
This way you would be able to track the runtime deps and remove them
if their refcount went to 0 after removing xe.
(2) is doable, but previous attempts were not successful [1]. Is there something
else to make the simpler solution (1) to work?
Reference why I am doing this changes, please see review comments of this patch https://patchwork.freedesktop.org/series/137343/
Regards,
Krishna.
thanks
Lucas De Marchi
[1] https://lore.kernel.org/linux-
modules/cover.1652113087.git.mchehab@xxxxxxxxxx/
>mei_hdcp 28672 0
>mei_pxp 16384 0
>mei_me 49152 3 mei_gsc
>mei 167936 7 mei_gsc_proxy,mei_gsc,mei_hdcp,mei_pxp,mei_me
>root@DUT6127BMGFRD:/home/gta#
>
>Regards,
>Krishna.
>
>>
>> Lucas De Marchi
>>
>> >
>> >Cc: Alexander Usyskin <alexander.usyskin@xxxxxxxxx>
>> >Cc: Tomas Winkler <tomas.winkler@xxxxxxxxx>
>> >Cc: Alan Previn <alan.previn.teres.alexis@xxxxxxxxx>
>> >Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@xxxxxxxxx>
>> >Cc: Lucas De Marchi <lucas.demarchi@xxxxxxxxx>
>> >Cc: Thomas Hellström <thomas.hellstrom@xxxxxxxxxxxxxxx>
>> >Cc: Jani Nikula <jani.nikula@xxxxxxxxx>
>> >Cc: Joonas Lahtinen <joonas.lahtinen@xxxxxxxxxxxxxxx>
>> >Cc: Tvrtko Ursulin <tursulin@xxxxxxxxxxx>
>> >
>> >> +
>> >> static int __init xe_init(void)
>> >> {
>> >> int err, i;
>> >> --
>> >> 2.25.1
>> >>