On Wed, 16 Mar 2016 15:21:24 +0530 Chandra Sekhar Lingutla <clingutla@xxxxxxxxxxxxxx> wrote: > Hi Ming, > > On 03/16/2016 02:34 PM, Chandra Sekhar Lingutla wrote: > > On 03/16/2016 02:28 PM, Ming Lei wrote: > >> On Wed, Mar 16, 2016 at 4:48 PM, Chandra Sekhar Lingutla > >> <clingutla@xxxxxxxxxxxxxx> wrote: > >>> On 03/16/2016 02:12 PM, Ming Lei wrote: > >>>> > >>>> On Wed, Mar 16, 2016 at 3:02 PM, Chandra Sekhar Lingutla > >>>> <clingutla@xxxxxxxxxxxxxx> wrote: > >>>>> > >>>>> Hi Ming, > >>>>> > >>>>> > >>>>> On 03/16/2016 07:36 AM, Ming Lei wrote: > >>>>>> > >>>>>> > >>>>>> On Tue, Mar 15, 2016 at 11:09 PM, Ming Lei <ming.lei@xxxxxxxxxxxxx> > >>>>>> wrote: > >>>>>>> > >>>>>>> > >>>>>>> Hi, > >>>>>>> > >>>>>>> On Mon, Mar 14, 2016 at 8:00 PM, <clingutla@xxxxxxxxxxxxxx> wrote: > >>>>>>>> > >>>>>>>> > >>>>>>>> Hello, > >>>>>>>> > >>>>>>>> I see possible race in _request_firmware_load function, on which I > >>>>>>>> wanted to > >>>>>>>> take your opinion. > >>>>>>>> > >>>>>>>> When system is going to low power mode, device_cache_fw_images() is > >>>>>>>> called > >>>>>>>> from pm notifier which schedules asynchronous workers to cache devices > >>>>>>>> firmware. > >>>>>>>> If more than 2 async requests falls under user helper mode, then there > >>>>>>>> is a > >>>>>>>> use after free of "firmware" directory kobject. > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> If two requests are for getting same firmware image, only one can be > >>>>>>> sent > >>>>>>> out and be completed with user helper, please see > >>>>>>> fw_lookup_and_allocate_buf(), > >>>>>>> so the race shouldn't have been triggered in this case. > >>>>>>> > >>>>>>> So I guess your test must be involved with at least two different > >>>>>>> requests. > >>>>>>> > >>>>>>>> > >>>>>>>> The race will hit, when one asynchronous worker calls device_del() > >>>>>>>> while > >>>>>>>> other asynchronous worker calls get_device_parent() in > >>>>>>>> _request_firmware_load(). > >>>>>>>> After loading firmware, first worker calls device_del(), and > >>>>>>>> "firmware" > >>>>>>>> directory ref count is 1, so cleanup_glue_dir() calls kobject_put() > >>>>>>>> with > >>>>>>>> gdp_mutex lock held. > >>>>>>>> Meanwhile, second async worker calls device_add(), and > >>>>>>>> get_device_parent() > >>>>>>>> is still able to find the firmware directory kobject in ksets after > >>>>>>>> getting > >>>>>>>> gdp_mutex. > >>>>>>>> This kobject gets added as parent to the second device kobject. > >>>>>>>> By the time of accessing the parent kobject, its ref count was 0 and > >>>>>>>> prints > >>>>>>>> below warning stack followed by crash. > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> The trick here is just that we don't use 'struct device' in > >>>>>>> device_cache_fw_images(), > >>>>>>> then the firmware device's parent is null during calling device_add() > >>>>>>> from > >>>>>>> _request_firmware_load(), but this should be allowed by driver core. > >>>>>>> > >>>>>>> From firmware class view, both device_add()/device_del() is run with > >>>>>>> pair, > >>>>>>> and driver core still complains during get_device_parent(), so > >>>>>>> something > >>>>>>> inside driver core should be be wrong, IMO. > >>>>>> > >>>>>> > >>>>>> > >>>>>> Thinking about the issue further, I believe it is a false positive which > >>>>>> is > >>>>>> caused by CONFIG_DEBUG_KOBJECT_RELEASE, follows the story: > >>>>>> > >>>>>> 1) device_add() path > >>>>>> > >>>>>> get_device_parent() > >>>>>> ...... > >>>>>> mutex_lock(&gdp_mutex); > >>>>>> > >>>>>> /* find our class-directory at the parent and > >>>>>> reference > >>>>>> it */ > >>>>>> spin_lock(&dev->class->p->glue_dirs.list_lock); > >>>>>> list_for_each_entry(k, &dev->class->p->glue_dirs.list, > >>>>>> entry) > >>>>>> if (k->parent == parent_kobj) { > >>>>>> kobj = kobject_get(k); > >>>>>> break; > >>>>>> } > >>>>>> spin_unlock(&dev->class->p->glue_dirs.list_lock); > >>>>>> if (kobj) { > >>>>>> mutex_unlock(&gdp_mutex); > >>>>>> return kobj; > >>>>>> } > >>>>>> k = class_dir_create_and_add(dev->class, parent_kobj); > >>>>>> mutex_unlock(&gdp_mutex); > >>>>>> ...... > >>>>>> > >>>>>> 2) device_del() path > >>>>>> > >>>>>> cleanup_device_parent() > >>>>>> ->cleanup_glue_dir() > >>>>>> ->mutex_lock(&gdp_mutex); > >>>>>> ->kobject_put(glue_dir); > >>>>>> ->kobject_cleanup() > >>>>>> ->kobject_del() > >>>>>> ->kobj_kset_leave() > >>>>>> ->mutex_unlock(&gdp_mutex); > >>>>>> > >>>>>> So gdp_mutex is held wrt. both looking up/adding the parent kobject > >>>>>> and removing the parent kobject. Actually with this lock, both glur_dir > >>>>>> kobject's referece counting and joining/leaving kset are run atomically. > >>>>>> > >>>>>> The race you found is triggered by CONFIG_DEBUG_KOBJECT_RELEASE, > >>>>>> which causes kobject_cleanup() to be run in workqueue context and the > >>>>>> lock of gdp_mutex can't be hold in that situation anymore. That is the > >>>>>> root > >>>>>> cause for your race. > >>>>>> > >>>>>> That also said DEBUG_KOBJECT_RELEASE might cause trouble in > >>>>>> some situations. > >>>>>> > >>>>>> Thanks, > >>>>>> Ming > >>>>>> > >>>>> I agree that, CONFIG_DEBUG_KOBJECT_RELEASE can cause the race, as it > >>>>> delays > >>>>> call to kobject_cleanup() which deletes kobj entry in glue_dir list. > >>>>> > >>>>> I see the race even after disabling CONFIG_DEBUG_KOBJECT_RELEASE. > >>>>> > >>>>> The race is between kobject_put() and kobject_get() for the "firmware" > >>>>> directory. > >>>>> > >>>>> In device_del() path, I see the race after the call of > >>>>> cleanup_device_parent(). > >>>>> cleanup_device_parent() calls kobject_put(glue_dir) with gdp_mutex lock > >>>>> holding, > >>>>> So ideally, to call kobject_put/kobject_get(glue_dir) we need to acquire > >>>>> gdp_mutex. > >>>>> > >>>>> but kobject_del(kobj) will not hold the gdp_mutex lock while calling > >>>>> kobject_put(kobj->parent). > >>>>> > >>>>> Here is the race: > >>>>> > >>>>> Consider 2 async works calls request_firmware() for 2 different binaries > >>>>> (fw1.bin, fw2.bin) > >>>>> Async work1 got scheduled first, and it falls in user helper mode so it > >>>>> creates "firmware" dir > >>>>> and then creates "fw1.bin" directory to load data from user. > >>>>> After load completion or timeout, it starts cleanup dir's, meantime async > >>>>> work2 got scheduled and > >>>>> it also falls in user helper mode, and story starts here: > >>>>> > >>>>> PATH1 req fw for "fw1.bin" > >>>>> PATH2 > >>>>> req fw for "fw2.bin" > >>>>> _request_firmware_load() > >>>>> device_del(dev1) > >>>>> > >>>>> ->cleanup_device_parent() > >>>>> > >>>>> mutex_lock(&gdp_mutex); > >>>>> > >>>>> _request_firmware_load("fw2.bin") > >>>>> kobject_put(kobj->parent); --> firmware ref ->1 > >>>>> ->device_add(dev2) > >>>>> mutex_unlock(&gdp_mutex); > >>>>> ->get_device_parent() > >>>>> ->kobject_del() > >>>>> mutex_lock(&gdp_mutex) > >>>>> kobj_kset_leave(kobj); -->del "fw1.bin" kobj in kset > >>>>> list_for_each_entry(k, &dev->class->p->glue_dirs.list, entry) > >>>>> kobject_put(kobj->parent); /* No lock :)*/ > >>>>> if (k->parent == parent_kobj) { > >>>>> > >>>>> ->kref_put() --> firmware ref ->0 > >>>>> kobj = kobject_get(k); -> k is "firmware" > >>>>> dir > >>>>> ->kboject_release() > >>>>> break; > >>>>> > >>>>> ->kobject_del() > >>>>> } > >>>>> > >>>>> /*here deletes firmware entry in kset*/ > >>>>> mutex_unlock(&gdp_mutex) > >>>>> > >>>>> ->dev2->kobj.parent = kobj -> kobj is released already > >>>>> > >>>>> > >>>>> > >>>>> PATH 1 is started removing the kobject of "fw1.bin" and its parent > >>>>> "firmware" > >>>>> . > >>>>> cleanup_device_parent(dev1) makes firmware ref count to 1, by calling > >>>>> kobject_put(firmware) with gdp_mutex held. > >>>>> > >>>>> and kobject_del(dev1->kobj) will again calls kobject_put(kobj->parent) > >>>>> without holding gdp_mutex. > >>>>> > >>>>> In PATH2, glue_dir is able to find "firmware" kobj, as firmware dir kobj > >>>>> is > >>>>> deleted from the glue_dir list at kboject_release(). > >>>>> and it calls kobject_get(firmware), by the time PATH1 calls > >>>>> kobject_release(firmware). > >>>>> > >>>>> So PATH2 got reference of freed firmware kobj and added as parent to > >>>>> "fw2.bin" kobj. > >>>>> > >>>>> I think, here is possible fix for this race: > >>>>> > >>>>> ------------------------8<------------------------------------- > >>>>> Signed-off-by: Lingutla Chandrasekhar <clingutla@xxxxxxxxxxxxxx> > >>>>> > >>>>> diff --git a/drivers/base/core.c b/drivers/base/core.c > >>>>> index 5d8b713..47bc082 100644 > >>>>> --- a/drivers/base/core.c > >>>>> +++ b/drivers/base/core.c > >>>>> @@ -1224,7 +1224,13 @@ void device_del(struct device *dev) > >>>>> BUS_NOTIFY_REMOVED_DEVICE, > >>>>> dev); > >>>>> kobject_uevent(&dev->kobj, KOBJ_REMOVE); > >>>>> cleanup_device_parent(dev); > >>>>> - kobject_del(&dev->kobj); > >>>>> + > >>>>> + if (dev->kobj.parent->kset == &dev->class->p->glue_dirs) { > >>>>> + mutex_lock(&gdp_mutex); > >>>>> + kobject_del(&dev->kobj); > >>>>> + mutex_unlock(&gdp_mutex); > >>>>> + } else > >>>>> + kobject_del(&dev->kobj); > >>>>> put_device(parent); > >>>>> } > >>>>> EXPORT_SYMBOL_GPL(device_del); > >>>>> > >>>>> --------------------------->8----------------------------------- > >>>> > >>>> > >>>> OK, one simpler way is to move cleanup_device_parent(dev) > >>>> after kobject_del(&dev->kobj), which looks more clean too. > >>>> > >>> No, We can't move that, bcz kobject_del(&dev->kobj) will clear the parent. > >> > >> Yeah, but I think it is better to do that and just call > >> cleanup_glue_dir() directly > >> because cleanup_device_parent() should have released the glue_dir. > >> > Sorry, i am not able understand correctly, If we reorder kobje_del and cleanup_device_parent(), > kobject_del() clears kobj->parent, and cleanup_device_parent() gets glue_dir with dev->kobj->parent(), > cleanup_glue_dir() returns without calling kobject_put() as glue_dir is NULL, so firmware dir becomes dangling kobj. > Am I missing something here? I mean something like below, and what do you think about it? --- diff --git a/drivers/base/core.c b/drivers/base/core.c index 0a8bdad..3f2e1f8 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -836,11 +836,31 @@ static struct kobject *get_device_parent(struct device *dev, return NULL; } +static inline bool live_in_glue_dir(struct kobject *kobj, + struct device *dev) +{ + if (!kobj || !dev->class || + kobj->kset != &dev->class->p->glue_dirs) + return true; + return false; +} + +static inline struct kobject *get_glue_dir(struct device *dev) +{ + if (live_in_glue_dir(&dev->kobj, dev)) + return dev->kobj.parent; + return NULL; +} + +/* + * make sure cleaning up dir as the last step, we need to make + * sure .release handler of kobject is run with holding the + * global lock + */ static void cleanup_glue_dir(struct device *dev, struct kobject *glue_dir) { /* see if we live in a "glue" directory */ - if (!glue_dir || !dev->class || - glue_dir->kset != &dev->class->p->glue_dirs) + if (!live_in_glue_dir(glue_dir, dev)) return; mutex_lock(&gdp_mutex); @@ -848,11 +868,6 @@ static void cleanup_glue_dir(struct device *dev, struct kobject *glue_dir) mutex_unlock(&gdp_mutex); } -static void cleanup_device_parent(struct device *dev) -{ - cleanup_glue_dir(dev, dev->kobj.parent); -} - static int device_add_class_symlinks(struct device *dev) { struct device_node *of_node = dev_of_node(dev); @@ -1028,6 +1043,7 @@ int device_add(struct device *dev) struct kobject *kobj; struct class_interface *class_intf; int error = -EINVAL; + struct kobject *glue_dir = NULL; dev = get_device(dev); if (!dev) @@ -1072,8 +1088,10 @@ int device_add(struct device *dev) /* first, register with generic layer. */ /* we require the name to be set before, and pass NULL */ error = kobject_add(&dev->kobj, dev->kobj.parent, NULL); - if (error) + if (error) { + glue_dir = get_glue_dir(dev); goto Error; + } /* notify platform of device entry */ if (platform_notify) @@ -1154,9 +1172,10 @@ done: device_remove_file(dev, &dev_attr_uevent); attrError: kobject_uevent(&dev->kobj, KOBJ_REMOVE); + glue_dir = get_glue_dir(dev); kobject_del(&dev->kobj); Error: - cleanup_device_parent(dev); + cleanup_glue_dir(dev, glue_dir); put_device(parent); name_error: kfree(dev->p); @@ -1232,6 +1251,7 @@ EXPORT_SYMBOL_GPL(put_device); void device_del(struct device *dev) { struct device *parent = dev->parent; + struct kobject *glue_dir = NULL; struct class_interface *class_intf; /* Notify clients of device removal. This call must come @@ -1276,8 +1296,9 @@ void device_del(struct device *dev) blocking_notifier_call_chain(&dev->bus->p->bus_notifier, BUS_NOTIFY_REMOVED_DEVICE, dev); kobject_uevent(&dev->kobj, KOBJ_REMOVE); - cleanup_device_parent(dev); + glue_dir = get_glue_dir(dev); kobject_del(&dev->kobj); + cleanup_glue_dir(dev, glue_dir); put_device(parent); } EXPORT_SYMBOL_GPL(device_del); > > >> BTW, failure path of device_add() has been broken already. > >> > > Yah, forgot to mention that. we need to reorder like above fix, can we?. > > > >>> > >>> > >>>>> > >>>>> > >>>>>>> > >>>>>>>> > >>>>>>>> [21147.455038] ------------[ cut here ]------------ > >>>>>>>> [21147.458910] WARNING: at <...>/kernel/include/linux/kref.h:47 > >>>>>>>> kobject_get+0x50/0x68() > >>>>>>>> [21147.481199] Modules linked in: core_ctl(PO) qdrbg_module(O) > >>>>>>>> qcrypto_module(O) > >>>>>>>> [21147.481256] CPU: 2 PID: 23935 Comm: kworker/u16:8 Tainted: P W O > >>>>>>>> 3.10.84-g0957845-00427-g56a05c2 #1 > >>>>>>>> [21147.481284] Workqueue: events_unbound async_run_entry_fn > >>>>>>>> [21147.481296] Call trace: > >>>>>>>> [21147.481315] [<ffffffc000206b60>] dump_backtrace+0x0/0x270 > >>>>>>>> [21147.481331] [<ffffffc000206de0>] show_stack+0x10/0x1c > >>>>>>>> [21147.481353] [<ffffffc000d569ec>] dump_stack+0x1c/0x28 > >>>>>>>> [21147.481370] [<ffffffc00021cc18>] warn_slowpath_common+0x74/0x9c > >>>>>>>> [21147.481383] [<ffffffc00021ce4c>] warn_slowpath_null+0x14/0x20 > >>>>>>>> [21147.481397] [<ffffffc000456b10>] kobject_get+0x4c/0x68 > >>>>>>>> [21147.481416] [<ffffffc000609f64>] get_device_parent+0xa0/0x194 > >>>>>>>> [21147.481430] [<ffffffc00060a3f4>] device_add+0x100/0x600 > >>>>>>>> [21147.481446] [<ffffffc00061c390>] _request_firmware+0x8b4/0xa80 > >>>>>>>> [21147.481459] [<ffffffc00061c58c>] request_firmware+0x30/0x3c > >>>>>>>> [21147.481473] [<ffffffc00061c5ec>] cache_firmware+0x54/0xb0 > >>>>>>>> [21147.481490] [<ffffffc00061c65c>] > >>>>>>>> __async_dev_cache_fw_image+0x14/0x54 > >>>>>>>> [21147.481505] [<ffffffc000243794>] async_run_entry_fn+0x6c/0x12c > >>>>>>>> [21147.481521] [<ffffffc000237e88>] process_one_work+0x264/0x3dc > >>>>>>>> [21147.481535] [<ffffffc0002392c0>] worker_thread+0x1f0/0x340 > >>>>>>>> [21147.481553] [<ffffffc00023e500>] kthread+0xac/0xb8 > >>>>>>>> [21147.481563] ---[ end trace dabc98ea48b8ba59 ]--- > >>>>>>>> > >>>>>>>> [21147.486436] Unable to handle kernel paging request at virtual > >>>>>>>> address > >>>>>>>> 0x6b6b6b6b6b6b6bcb > >>>>>>>> [21147.493816] pgd = ffffffc0a982d000 > >>>>>>>> [21147.498337] [6b6b6b6b6b6b6bcb] *pgd=0000000000000000 > >>>>>>>> [21147.502287] Internal error: Oops: 96000004 [#1] PREEMPT SMP > >>>>>>>> [21147.507826] Modules linked in: core_ctl(PO) qdrbg_module(O) > >>>>>>>> qcrypto_module(O) > >>>>>>>> [21147.514951] CPU: 2 PID: 23935 Comm: kworker/u16:8 Tainted: P W O > >>>>>>>> 3.10.84-g0957845-00427-g56a05c2 #1 > >>>>>>>> [21147.524686] Workqueue: events_unbound async_run_entry_fn > >>>>>>>> [21147.529965] task: ffffffc008138000 ti: ffffffc007880000 task.ti: > >>>>>>>> ffffffc007880000 > >>>>>>>> [21147.537438] PC is at sysfs_create_dir+0x34/0xd4 > >>>>>>>> [21147.541949] LR is at kobject_add_internal+0x120/0x24c > >>>>>>>> [21147.546979] pc : [<ffffffc000361fa8>] lr : [<ffffffc000456f5c>] > >>>>>>>> pstate: > >>>>>>>> 80000145 > >>>>>>>> [21147.996295] Process kworker/u16:8 (pid: 23935, stack limit = > >>>>>>>> 0xffffffc007880058) > >>>>>>>> [21148.003673] Call trace: > >>>>>>>> [21148.006117] [<ffffffc000361fa8>] sysfs_create_dir+0x34/0xd4 > >>>>>>>> [21148.011663] [<ffffffc000456f58>] kobject_add_internal+0x11c/0x24c > >>>>>>>> [21148.017737] [<ffffffc0004573ec>] kobject_add+0xc8/0xe4 > >>>>>>>> [21148.022863] [<ffffffc00060a410>] device_add+0x11c/0x600 > >>>>>>>> [21148.028069] [<ffffffc00061c390>] _request_firmware+0x8b4/0xa80 > >>>>>>>> [21148.033886] [<ffffffc00061c58c>] request_firmware+0x30/0x3c > >>>>>>>> [21148.039439] [<ffffffc00061c5ec>] cache_firmware+0x54/0xb0 > >>>>>>>> [21148.044822] [<ffffffc00061c65c>] > >>>>>>>> __async_dev_cache_fw_image+0x14/0x54 > >>>>>>>> [21148.051249] [<ffffffc000243794>] async_run_entry_fn+0x6c/0x12c > >>>>>>>> [21148.057064] [<ffffffc000237e88>] process_one_work+0x264/0x3dc > >>>>>>>> [21148.062792] [<ffffffc0002392c0>] worker_thread+0x1f0/0x340 > >>>>>>>> [21148.068263] [<ffffffc00023e500>] kthread+0xac/0xb8 > >>>>>>>> [21148.073039] Code: b5000094 14000026 d000ac54 9125a294 (7940c281) > >>>>>>>> [21148.080312] ---[ end trace dabc98ea48b8ba5a ]--- > >>>>>>>> > >>>>>>>> Below is the callstack where the parent is freed from the slub_debug > >>>>>>>> info. > >>>>>>>> 0xFFFFFFC000608D74 \\vmlinux\base/core\class_dir_release+0x0C > >>>>>>>> 0xFFFFFFC0002FE6BC \\vmlinux\slub\free_debug_processing\fail+0x114 > >>>>>>>> 0xFFFFFFC0002FE7A0 \\vmlinux\slub\__slab_free+0x44 > >>>>>>>> 0xFFFFFFC0002FEE60 \\vmlinux\slub\kfree+0x1F8 > >>>>>>>> 0xFFFFFFC000608D70 \\vmlinux\base/core\class_dir_release+0x8 > >>>>>>>> 0xFFFFFFC000456D70 \\vmlinux\kobject\kobject_release+0x134 > >>>>>>>> 0xFFFFFFC000456E10 \\vmlinux\kobject\kobject_put+0x58 > >>>>>>>> 0xFFFFFFC000456C28 \\vmlinux\kobject\kobject_del+0x64 > >>>>>>>> 0xFFFFFFC000609D78 \\vmlinux\base/core\device_del+0x150 > >>>>>>>> 0xFFFFFFC00061C524 > >>>>>>>> \\vmlinux\firmware_class\_request_firmware\out+0x71C > >>>>>>>> 0xFFFFFFC00061C58C \\vmlinux\firmware_class\request_firmware+0x30 > >>>>>>>> 0xFFFFFFC00061C5EC \\vmlinux\firmware_class\cache_firmware+0x54 > >>>>>>>> 0xFFFFFFC00061C65C > >>>>>>>> \\vmlinux\firmware_class\__async_dev_cache_fw_image+0x14 > >>>>>>>> 0xFFFFFFC000243794 \\vmlinux\async\async_run_entry_fn+0x6C > >>>>>>>> 0xFFFFFFC000237E88 \\vmlinux\workqueue\process_one_work+0x264 > >>>>>>>> 0xFFFFFFC0002392C0 \\vmlinux\workqueue\worker_thread\recheck+0x1A0 > >>>>>>>> 0xFFFFFFC00023E500 \\vmlinux\kthread\kthread+0xAC > >>>>>>>> > >>>>>>>> For similar type of issue, I see there is an existing fix: "sysfs: > >>>>>>>> driver > >>>>>>>> core: Fix glue dir race condition by gdp_mutex". > >>>>>>>> > >>>>>>>> > >>>>>>>> https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/drivers/base/core.c?id=e4a60d139060975eb956717e4f63ae348d4d8cc5 > >>>>>>>> > >>>>>>>> Still I am able to reproduce the issue. I have verified this race on > >>>>>>>> kernels: 3.10, 3.18 and 4.4. > >>>>>>>> > >>>>>>>> I followed below procedure to reproduce the issue: > >>>>>>>> 1. Enable "CONFIG_DEBUG_KOBJECT_RELEASE" > >>>>>>>> 2. Use test_firmware, modified the test for async calls > >>>>>>>> 3. Replace WARN_ON with BUG_ON in kref_get(). > >>>>>>>> I ran below script from shell: > >>>>>>>> count=0 > >>>>>>>> while [ 1 ] > >>>>>>>> do > >>>>>>>> echo 3 > > >>>>>>>> /sys/devices/virtual/misc/test_firmware/trigger_ufw > >>>>>>>> if [ $? -ne 0 ]; then > >>>>>>>> print "Exiting.. " > >>>>>>>> exit 1 > >>>>>>>> fi > >>>>>>>> count=$(($count +1)) > >>>>>>>> echo "count $count" > >>>>>>>> done > >>>>>>>> > >>>>>>>> With below patch, I could reproduce the issue in the 2nd iteration. > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> I will run your test script to see if there is new findings. > >>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> -------------------------8<---------------------------------------------- > >>>>>>>> > >>>>>>>> To reproduce the race condition, use firmware test and > >>>>>>>> CONFIG_DEBUG_KOBJECT_RELEASE=y, which delays kobject > >>>>>>>> releases. So that we can catch the bug easily. > >>>>>>>> > >>>>>>>> Enabled below flags in kernel config file: > >>>>>>>> > >>>>>>>> +CONFIG_DEBUG_KOBJECT_RELEASE=y > >>>>>>>> +CONFIG_TEST_FIRMWARE=y > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> Looks I can't reproduce the issue with your approach. > >>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> diff --git a/include/linux/kref.h b/include/linux/kref.h > >>>>>>>> index 484604d..727fb24 100644 > >>>>>>>> --- a/include/linux/kref.h > >>>>>>>> +++ b/include/linux/kref.h > >>>>>>>> @@ -44,7 +44,7 @@ static inline void kref_get(struct kref *kref) > >>>>>>>> * condition when this kref is freeing by some other thread > >>>>>>>> right > >>>>>>>> now. > >>>>>>>> * In this case one should use kref_get_unless_zero() > >>>>>>>> */ > >>>>>>>> - WARN_ON_ONCE(atomic_inc_return(&kref->refcount) < 2); > >>>>>>>> + BUG_ON(atomic_inc_return(&kref->refcount) < 2); > >>>>>>>> } > >>>>>>>> > >>>>>>>> /** > >>>>>>>> diff --git a/lib/test_firmware.c b/lib/test_firmware.c > >>>>>>>> old mode 100644 > >>>>>>>> new mode 100755 > >>>>>>>> index 86374c1..14c9598 > >>>>>>>> --- a/lib/test_firmware.c > >>>>>>>> +++ b/lib/test_firmware.c > >>>>>>>> @@ -18,6 +18,7 @@ > >>>>>>>> #include <linux/miscdevice.h> > >>>>>>>> #include <linux/slab.h> > >>>>>>>> #include <linux/uaccess.h> > >>>>>>>> +#include <linux/async.h> > >>>>>>>> > >>>>>>>> + > >>>>>>>> +static ASYNC_DOMAIN_EXCLUSIVE(ufw_domain); > >>>>>>>> + > >>>>>>>> +static void __async_ufw_work(void *name, > >>>>>>>> + async_cookie_t cookie) > >>>>>>>> +{ > >>>>>>>> + const struct firmware *fw; > >>>>>>>> + const char *fw_name = name; > >>>>>>>> + int ret; > >>>>>>>> + > >>>>>>>> + pr_err("requesting %s\n", fw_name); > >>>>>>>> + > >>>>>>>> + ret = request_firmware(&fw, fw_name, NULL); > >>>>>>>> + pr_err("loaded: %zu\n", fw ? fw->size : 0); > >>>>>>>> + if (!ret) > >>>>>>>> + kfree(fw); > >>>>>>>> + > >>>>>>>> +} > >>>>>>>> + > >>>>>>>> +static char *name[] = { "fw1.bin", "fw2.bin"}; > >>>>>>>> + > >>>>>>>> +static ssize_t trigger_ufw_store(struct device *dev, > >>>>>>>> + struct device_attribute *attr, > >>>>>>>> + const char *buf, size_t count) > >>>>>>>> +{ > >>>>>>>> + int rc, i, c=0; > >>>>>>>> + u32 start; > >>>>>>>> + char *fw_name; > >>>>>>>> + > >>>>>>>> + rc = kstrtou32(buf, 0, &start); > >>>>>>>> + if (rc){ > >>>>>>>> + pr_err("Invalid option\n"); > >>>>>>>> + return rc; > >>>>>>>> + } > >>>>>>>> + pr_err(" no of iterations %d\n", start); > >>>>>>>> + > >>>>>>>> + for (i=0; i< start; i++) > >>>>>>>> + { > >>>>>>>> + if (c >= 2) > >>>>>>>> + c = 0; > >>>>>>>> + fw_name = name[c++]; > >>>>>>>> + async_schedule_domain(__async_ufw_work, (void *)fw_name, > >>>>>>>> &ufw_domain); > >>>>>>>> + } > >>>>>>>> + async_synchronize_full_domain(&ufw_domain); > >>>>>>>> + return count; > >>>>>>>> +} > >>>>>>>> > >>>>>>>> static DEVICE_ATTR_WO(trigger_request); > >>>>>>>> +static DEVICE_ATTR_WO(trigger_ufw); > >>>>>>>> > >>>>>>>> static int __init test_firmware_init(void) > >>>>>>>> { > >>>>>>>> @@ -92,6 +140,12 @@ static int __init test_firmware_init(void) > >>>>>>>> goto dereg; > >>>>>>>> } > >>>>>>>> > >>>>>>>> + rc = device_create_file(test_fw_misc_device.this_device, > >>>>>>>> + &dev_attr_trigger_ufw); > >>>>>>>> + if (rc) { > >>>>>>>> + pr_err("could not create sysfs interface: %d\n", rc); > >>>>>>>> + goto dereg; > >>>>>>>> + } > >>>>>>>> pr_warn("interface ready\n"); > >>>>>>>> > >>>>>>>> return 0; > >>>>>>>> @@ -106,7 +160,9 @@ static void __exit test_firmware_exit(void) > >>>>>>>> { > >>>>>>>> release_firmware(test_firmware); > >>>>>>>> device_remove_file(test_fw_misc_device.this_device, > >>>>>>>> - &dev_attr_trigger_request); > >>>>>>>> + &dev_attr_trigger_request); > >>>>>>>> + device_remove_file(test_fw_misc_device.this_device, > >>>>>>>> + &dev_attr_trigger_ufw); > >>>>>>>> misc_deregister(&test_fw_misc_device); > >>>>>>>> pr_warn("removed interface\n"); > >>>>>>>> } > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> ----------------------------8<-------------------------------------------- > >>>>>>>> > >>>>>>>> Thanks and regards, > >>>>>>>> Chandrasekhar L. > >>>>>>>> -- > >>>>>>>> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a > >>>>>>>> member of > >>>>>>>> Code Aurora Forum, > >>>>>>>> a Linux Foundation Collaborative Project. > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-arm-msm" in > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-arm-msm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html