On Tue, 2022-01-11 at 16:27 -0800, Stephen Boyd wrote: > Quoting Yong Wu (2022-01-11 04:22:23) > > Hi Stephen, > > > > Thanks for helping update here. > > > > On Thu, 2022-01-06 at 13:45 -0800, Stephen Boyd wrote: > > > Use an aggregate driver instead of component ops so that we can > > > get > > > proper driver probe ordering of the aggregate device with respect > > > to > > > all > > > the component devices that make up the aggregate device. > > > > > > Cc: Yong Wu <yong.wu@xxxxxxxxxxxx> > > > Cc: Joerg Roedel <joro@xxxxxxxxxx> > > > Cc: Will Deacon <will@xxxxxxxxxx> > > > Cc: Daniel Vetter <daniel.vetter@xxxxxxxx> > > > Cc: "Rafael J. Wysocki" <rafael@xxxxxxxxxx> > > > Cc: Rob Clark <robdclark@xxxxxxxxx> > > > Cc: Russell King <rmk+kernel@xxxxxxxxxxxxxxxx> > > > Cc: Saravana Kannan <saravanak@xxxxxxxxxx> > > > Signed-off-by: Stephen Boyd <swboyd@xxxxxxxxxxxx> > > > > When I test this on mt8195 which have two IOMMU HWs(calling > > component_aggregate_regsiter twice), it will abort like this. Then > > what > > should we do if we have two instances? > > > > Thanks for testing it out. We can't register the struct driver more > than > once but this driver is calling the component_aggregate_register() > function from the driver probe and there are two devices bound to the > mtk-iommu driver so we try to register it more than once. Sigh! > > I see a couple options. One is to do a deep copy of the driver > structure > and change the driver name. Then it's a one to one relationship > between > device and driver. That's not very great because it leaves around > junk > so it should probably be avoided. > > Another option is to reference count the driver registration calls > when > component_aggregate_register() is called multiple times. Then we > would > only register the driver once and keep it pinned until the last > unregister call is made, but still remove devices that are created > for > the match table. > > Can you try the attached patch? It is based on the next version of > this > patch series so the include part of the patch may not apply cleanly. > > ---8<--- > diff --git a/drivers/base/component.c b/drivers/base/component.c > index 64ad7478c67a..97f253a41bdf 100644 > --- a/drivers/base/component.c > +++ b/drivers/base/component.c > @@ -492,15 +492,30 @@ static struct aggregate_device > *__aggregate_find(struct device *parent) > return dev ? to_aggregate_device(dev) : NULL; > } > > +static DEFINE_MUTEX(aggregate_mutex); > + > static int aggregate_driver_register(struct aggregate_driver *adrv) > { > - adrv->driver.bus = &aggregate_bus_type; > - return driver_register(&adrv->driver); > + int ret = 0; > + > + mutex_lock(&aggregate_mutex); > + if (!refcount_inc_not_zero(&adrv->count)) { > + adrv->driver.bus = &aggregate_bus_type; > + ret = driver_register(&adrv->driver); > + if (!ret) > + refcount_inc(&adrv->count); This should be refcount_set(&adrv->count, 1)? Otherwise, it will warning like this: [ 2.654526] ------------[ cut here ]------------ [ 2.655558] refcount_t: addition on 0; use-after-free. [ 2.656219] WARNING: CPU: 7 PID: 74 at ../v5.16- rc1/kernel/mediatek/lib/refcount.c:25 refcount_warn_saturate+0x128/0x148 ... [ 2.672227] Call trace: [ 2.672539] refcount_warn_saturate+0x128/0x148 [ 2.673118] component_aggregate_register+0x388/0x390 [ 2.673763] mtk_iommu_probe+0x638/0x690 [ 2.686467] ------------[ cut here ]------------ [ 2.687049] refcount_t: saturated; leaking memory. [ 2.687666] WARNING: CPU: 5 PID: 74 at ../v5.16- rc1/kernel/mediatek/lib/refcount.c:19 refcount_warn_saturate+0xfc/0x148 [ 2.703805] Call trace: [ 2.704117] refcount_warn_saturate+0xfc/0x148 [ 2.704685] component_aggregate_register+0x1fc/0x390 [ 2.705330] mtk_iommu_probe+0x638/0x690 > + } > + mutex_unlock(&aggregate_mutex); > + > + return ret; > } > > static void aggregate_driver_unregister(struct aggregate_driver > *adrv) > { > - driver_unregister(&adrv->driver); > + if (refcount_dec_and_mutex_lock(&adrv->count, > &aggregate_mutex)) { > + driver_unregister(&adrv->driver); > + mutex_unlock(&aggregate_mutex); > + } > } > > static struct aggregate_device *aggregate_device_add(struct device > *parent, > diff --git a/include/linux/component.h b/include/linux/component.h > index 53d81203c095..b061341938aa 100644 > --- a/include/linux/component.h > +++ b/include/linux/component.h > @@ -4,6 +4,7 @@ > > #include <linux/stddef.h> > #include <linux/device.h> > +#include <linux/refcount.h> > > struct aggregate_device; > > @@ -66,6 +67,7 @@ struct device *aggregate_device_parent(const struct > aggregate_device *adev); > > /** > * struct aggregate_driver - Aggregate driver (made up of other > drivers) > + * @count: driver registration refcount > * @driver: device driver > */ > struct aggregate_driver { > @@ -101,6 +103,7 @@ struct aggregate_driver { > */ > void (*shutdown)(struct aggregate_device *adev); > > + refcount_t count; > struct device_driver driver; > }; After this patch, the aggregate_driver flow looks ok. But our driver still aborts like this: [ 2.721316] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000 ... [ 2.731658] pc : mtk_smi_larb_config_port_gen2_general+0xa4/0x138 [ 2.732434] lr : mtk_smi_larb_resume+0x54/0x98 ... [ 2.742457] Call trace: [ 2.742768] mtk_smi_larb_config_port_gen2_general+0xa4/0x138 [ 2.743496] pm_generic_runtime_resume+0x2c/0x48 [ 2.744090] __genpd_runtime_resume+0x30/0xa8 [ 2.744648] genpd_runtime_resume+0x94/0x2c8 [ 2.745191] __rpm_callback+0x44/0x150 [ 2.745669] rpm_callback+0x6c/0x78 [ 2.746114] rpm_resume+0x314/0x558 [ 2.746559] __pm_runtime_resume+0x3c/0x88 [ 2.747080] pm_runtime_get_suppliers+0x7c/0x110 [ 2.747668] __driver_probe_device+0x4c/0xe8 [ 2.748212] driver_probe_device+0x44/0x130 [ 2.748745] __device_attach_driver+0x98/0xd0 [ 2.749300] bus_for_each_drv+0x68/0xd0 [ 2.749787] __device_attach+0xec/0x148 [ 2.750277] device_attach+0x14/0x20 [ 2.750733] bus_rescan_devices_helper+0x50/0x90 [ 2.751319] bus_for_each_dev+0x7c/0xd8 [ 2.751806] bus_rescan_devices+0x20/0x30 [ 2.752315] __component_add+0x7c/0xa0 [ 2.752795] component_add+0x14/0x20 [ 2.753253] mtk_smi_larb_probe+0xe0/0x120 This is because the device runtime_resume is called before the bind operation(In our case this detailed function is mtk_smi_larb_bind). The issue doesn't happen without this patchset. I'm not sure the right sequence. If we should fix in mediatek driver, the patch could be: diff --git a/drivers/memory/mtk-smi.c b/drivers/memory/mtk-smi.c index b883dcc0bbfa..288841555067 100644 --- a/drivers/memory/mtk-smi.c +++ b/drivers/memory/mtk-smi.c @@ -483,8 +483,9 @@ static int __maybe_unused mtk_smi_larb_resume(struct device *dev) if (ret < 0) return ret; - /* Configure the basic setting for this larb */ - larb_gen->config_port(dev); + /* Configure the basic setting for this larb after it binds with iommu */ + if (larb->mmu) + larb_gen->config_port(dev); return 0; } Another nitpick, the title should be: iommu/mediatek: xxxx