On 08/29, Viresh Kumar wrote: > On 28 August 2013 22:22, Stephen Boyd <sboyd@xxxxxxxxxxxxxx> wrote: > > > > I've applied these patches on top of v3.10 > > > > f51e1eb63d9c28cec188337ee656a13be6980cfd (cpufreq: Fix cpufreq regression after suspend/resume > > aae760ed21cd690fe8a6db9f3a177ad55d7e12ab (cpufreq: Revert commit a66b2e to fix suspend/resume regression) > > e8d05276f236ee6435e78411f62be9714e0b9377 (cpufreq: Revert commit 2f7021a8 to fix CPU hotplug regression) > > 2a99859932281ed6c2ecdd988855f8f6838f6743 (cpufreq: Fix cpufreq driver module refcount balance after suspend/resume) > > 419e172145cf6c51d436a8bf4afcd17511f0ff79 (cpufreq: don't leave stale policy pointer in cdbs->cur_policy) > > 95731ebb114c5f0c028459388560fc2a72fe5049 (cpufreq: Fix governor start/stop race condition) > > > > That second to last one causes a NULL pointer exception after the mutex > > warning above because the limits case does > > > > if (policy->max < cpu_cdbs->cur_policy->cur) > > > > and that dereferences a NULL cur_policy pointer. > > I have seen something similar and the error checking patch that > I mentioned earlier came as solution to that only.. Yes that patch may reduce the chance of the race condition but I don't believe it removes it entirely. I believe this bug still exists in linux-next. Consider the scenario where CPU1 is going down. __cpufreq_remove_dev() ret = __cpufreq_governor(policy, CPUFREQ_GOV_STOP); __cpufreq_governor() policy->governor->governor(policy, CPUFREQ_GOV_STOP); cpufreq_governor_dbs() case CPUFREQ_GOV_STOP: mutex_destroy(&cpu_cdbs->timer_mutex) cpu_cdbs->cur_policy = NULL; <PREEMPT> store() __cpufreq_set_policy() ret = __cpufreq_governor(policy, CPUFREQ_GOV_LIMITS); __cpufreq_governor() policy->governor->governor(policy, CPUFREQ_GOV_LIMITS); case CPUFREQ_GOV_LIMITS: mutex_lock(&cpu_cdbs->timer_mutex); <-- Warning (destroyed mutex) if (policy->max < cpu_cdbs->cur_policy->cur) <- cur_policy == NULL Once we stop the governor I don't see how another thread can't race in and get all the way down into the GOV_LIMITS case. Even if we wanted to lock out that thread with some mutex or semaphore it will have to continue running eventually and so we really need to wait until all the sysfs files are gone before we stop the governor (in the case of the last cpu for the policy) or we need to stop and start the governor while holding the policy semaphore to prevent a race. > > > Are there any fixes that I'm missing? I see that some things are > > changing in linux-next but they don't look like fixes, more like > > optimizations. > > Getting patches over 3.10 would be tricky.. You are two kernel > version back and that's not going to help much.. There are too many > patches in between linux-next and 3.10.. > > > I really can't tell you which specific ones to include, as I am lost in them :) That's a problem. 3.10 is the next long term stable kernel and so we need to backport any fixes to 3.10 for the next two years. Hopefully these bugs I'm finding in the 3.10 stable kernel's cpufreq code aren't known issues on 3.11/next. > > probably try to get all of them in ? i.e. All patches touching drivers/cpufreq > and include/linux/cpufreq.h.. I may have to try that. I got another crash below. This time governor was assigned to NULL in cpufreq_add_dev_interface() and then userspace came in and wrote to sampling_min_rate which tries to use the governor pointer in __cpufreq_governor() but it's NULL. It looks like a change silently fixed this problem by wrapping all this code in a rwsem (6eed940 cpufreq: Use rwsem for protecting critical sections). Should we backport that change to 3.10.x trees? Alternatively, we can reorder the creation of the sysfs files with the policy setup during CPU up (patch below) so that userspace can't possibly be in the kernel at this time. I think there is also another race between the cpufreq stats and hotplug. I'm getting a sysfs warning about creating duplicate cpufreq/stats files and I think that's because the thread that craeted the sysfs file is preempted before it can assign the cpufreq_stats_table and then cpufreq_add_dev_interface() comes in and tries to create the table a second time. I need to keep looking at that race to better understand it. > > I have got Arndale (Samsung-exnos) board where offlining CPUs is broken > @Kukjin: Can you please try to get it fixed?? It leads to crashes.. Maybe you can reproduce this on an x86 machine? This is all generic code. Unable to handle kernel NULL pointer dereference at virtual address 00000020 pgd = ea46c000 [00000020] *pgd=00000000 Internal error: Oops: 5 [#1] PREEMPT SMP ARM Modules linked in: CPU: 1 PID: 3665 Comm: sh Tainted: G W 3.10.0 #37 task: ea5b2300 ti: eaa6e000 task.ti: eaa6e000 PC is at __cpufreq_governor+0x10/0x1a4 LR is at __cpufreq_set_policy+0x278/0x2c0 pc : [<c0677218>] lr : [<c067765c>] psr: 60000013 sp : eaa6fe40 ip : 00000000 fp : 00000000 r10: ea437ddc r9 : c0a6d86c r8 : eaa6ff80 r7 : 00000000 r6 : 00000000 r5 : 00000003 r4 : ea437d80 r3 : 00000000 r2 : 000493e0 r1 : 00000000 r0 : ea437d80 Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user Control: 10c5787d Table: 2a46c06a DAC: 00000015 Process sh (pid: 3665, stack limit = 0xeaa6e238) Stack: (0xeaa6fe40 to 0xeaa70000) fe40: eaa6fe74 ea437d80 00000000 c067765c ea437d80 eaa6fe74 000493e0 ea437d80 fe60: 00000007 ea8d5000 c0ffb028 c0677c38 01437300 00000002 00000002 00000000 fe80: 00000001 00000000 00229200 000493e0 00000000 000493e0 00229200 000493e0 fea0: 00000000 00000000 00000000 00000000 ffffffe0 ea437dc0 ea437dc0 c0678ef4 fec0: 000493e0 00229200 00000000 00000000 ebe297c0 ea437de0 ea437de0 c32d3050 fee0: 00000000 c0ffaf84 ebd6ce40 00000002 00000003 00000000 00000000 dead4ead ff00: ffffffff ffffffff ea437e14 ea437e14 00000007 ea437d80 ea8d5000 c06788fc ff20: 00000007 eab7ecc0 ebe26e00 00000007 ebe26e18 c02ae634 ea437300 00000007 ff40: b85ae36c eaa6ff80 b85ae36c eaa6e000 00000007 c025d438 ea437300 b85ae36c ff60: 00000007 00000000 00000000 ea437300 00000000 b85ae36c 00000007 c025d790 ff80: 00000000 00000000 00000007 00000003 00000007 00000001 00000004 c0106304 ffa0: 00000000 c0106180 00000003 00000007 00000001 b85ae36c 00000007 ffffffff ffc0: 00000003 00000007 00000001 00000004 b85ae36c 00000000 00000000 00000000 ffe0: 00000000 bed834c8 b6f67d75 b6f02208 20000010 00000001 00000000 00000000 [<c0677218>] (__cpufreq_governor+0x10/0x1a4) from [<c067765c>] (__cpufreq_set_policy+0x278/0x2c0) [<c067765c>] (__cpufreq_set_policy+0x278/0x2c0) from [<c0677c38>] (store_scaling_min_freq+0x80/0x9c) [<c0677c38>] (store_scaling_min_freq+0x80/0x9c) from [<c06788fc>] (store+0x58/0x90) [<c06788fc>] (store+0x58/0x90) from [<c02ae634>] (sysfs_write_file+0x100/0x148) [<c02ae634>] (sysfs_write_file+0x100/0x148) from [<c025d438>] (vfs_write+0xcc/0x174) [<c025d438>] (vfs_write+0xcc/0x174) from [<c025d790>] (SyS_write+0x38/0x64) [<c025d790>] (SyS_write+0x38/0x64) from [<c0106180>] (ret_fast_syscall+0x0/0x30) Code: e92d4070 e1a05001 e5901030 e1a04000 (e5913020) Patch is based on 3.10 plus all the patches I mentioned above. ---8<---- diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c index cbfe3c1..ae4b59c 100644 --- a/drivers/cpufreq/cpufreq.c +++ b/drivers/cpufreq/cpufreq.c @@ -755,6 +755,29 @@ static int cpufreq_add_dev_interface(unsigned int cpu, if (ret) return ret; + write_lock_irqsave(&cpufreq_driver_lock, flags); + for_each_cpu(j, policy->cpus) { + per_cpu(cpufreq_cpu_data, j) = policy; + per_cpu(cpufreq_policy_cpu, j) = policy->cpu; + } + write_unlock_irqrestore(&cpufreq_driver_lock, flags); + + memcpy(&new_policy, policy, sizeof(struct cpufreq_policy)); + /* assure that the starting sequence is run in __cpufreq_set_policy */ + policy->governor = NULL; + + /* set default policy */ + ret = __cpufreq_set_policy(policy, &new_policy); + policy->user_policy.policy = policy->policy; + policy->user_policy.governor = policy->governor; + + if (ret) { + pr_debug("setting policy failed\n"); + if (cpufreq_driver->exit) + cpufreq_driver->exit(policy); + return ret; + } + /* set up files for this cpu device */ drv_attr = cpufreq_driver->attr; while ((drv_attr) && (*drv_attr)) { @@ -779,31 +802,10 @@ static int cpufreq_add_dev_interface(unsigned int cpu, goto err_out_kobj_put; } - write_lock_irqsave(&cpufreq_driver_lock, flags); - for_each_cpu(j, policy->cpus) { - per_cpu(cpufreq_cpu_data, j) = policy; - per_cpu(cpufreq_policy_cpu, j) = policy->cpu; - } - write_unlock_irqrestore(&cpufreq_driver_lock, flags); - ret = cpufreq_add_dev_symlink(cpu, policy); if (ret) goto err_out_kobj_put; - memcpy(&new_policy, policy, sizeof(struct cpufreq_policy)); - /* assure that the starting sequence is run in __cpufreq_set_policy */ - policy->governor = NULL; - - /* set default policy */ - ret = __cpufreq_set_policy(policy, &new_policy); - policy->user_policy.policy = policy->policy; - policy->user_policy.governor = policy->governor; - - if (ret) { - pr_debug("setting policy failed\n"); - if (cpufreq_driver->exit) - cpufreq_driver->exit(policy); - } return ret; err_out_kobj_put: -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation -- To unsubscribe from this list: send the line "unsubscribe cpufreq" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html