On Wed, Jul 13, 2011 at 10:10 AM, Matthew Wilcox <matthew@xxxxxx> wrote: > On Tue, Jul 12, 2011 at 10:30:35PM +0200, Jens Axboe wrote: >> It's probably the grouping, we need to do something about that. Does the >> below patch make it behave as you expect? > > "something", absolutely. But there is benefit from doing some aggregation > (we tried disabling it entirely with the "well-known OLTP benchmark" and > performance went down). > > Ideally we'd do something like "if the softirq is taking up more than 10% > of a core, split the grouping". Do we have enough stats to do that kind > of monitoring? What platform was your "OLTP benchmark" on? It seems that as the number of cores per package goes up, this grouping becomes too coarse, since almost everyone will have SCHED_MC set in the code: static inline int blk_cpu_to_group(int cpu) { int group = NR_CPUS; #ifdef CONFIG_SCHED_MC const struct cpumask *mask = cpu_coregroup_mask(cpu); group = cpumask_first(mask); #elif defined(CONFIG_SCHED_SMT) group = cpumask_first(topology_thread_cpumask(cpu)); #else return cpu; #endif if (likely(group < NR_CPUS)) return group; return cpu; } and so we use cpumask_first(cpu_coregroup_mask(cpu)). And from const struct cpumask *cpu_coregroup_mask(int cpu) { struct cpuinfo_x86 *c = &cpu_data(cpu); /* * For perf, we return last level cache shared map. * And for power savings, we return cpu_core_map */ if ((sched_mc_power_savings || sched_smt_power_savings) && !(cpu_has(c, X86_FEATURE_AMD_DCM))) return cpu_core_mask(cpu); else return cpu_llc_shared_mask(cpu); } in the "max performance" case, we use cpu_llc_shared_mask(). The problem as we've seen it is that on a dual-socket Westmere (Xeon 56xx) system, we have two sockets with 6 cores (12 threads) each, all sharing L3 cache, and so we end up with all block softirqs on only 2 out of 24 threads, which is not enough to handle all the IOPS that fast storage can provide. It's not clear to me what the right answer or tradeoffs are here. It might make sense to use only one hyperthread per core for block softirqs. As I understand the Westmere cache topology, there's not really an obvious intermediate step -- all the cores in a package share the L3, and then each core has its own L2. Limiting softirqs to 10% of a core seems a bit low, since we seem to be able to use more than 100% of a core handling block softirqs, and anyway magic numbers like that seem to always be wrong sometimes. Perhaps we could use the queue length on the destination CPU as a proxy for how busy ksoftirq is? - R. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html