On Tue, Jan 19, 2016 at 03:45:04PM -0500, Chris Metcalf wrote: > On 01/19/2016 10:42 AM, Frederic Weisbecker wrote: > >>+/* > >>+ * Isolation requires both nohz and isolcpus support from the scheduler. > >>+ * We provide a boot flag that enables both for now, and which we can > >>+ * add other functionality to over time if needed. Note that just > >>+ * specifying "nohz_full=... isolcpus=..." does not enable task isolation. > >>+ */ > >>+static int __init task_isolation_setup(char *str) > >>+{ > >>+ alloc_bootmem_cpumask_var(&task_isolation_map); > >>+ if (cpulist_parse(str, task_isolation_map) < 0) { > >>+ pr_warn("task_isolation: Incorrect cpumask '%s'\n", str); > >>+ return 1; > >>+ } > >>+ > >>+ alloc_bootmem_cpumask_var(&cpu_isolated_map); > >>+ cpumask_copy(cpu_isolated_map, task_isolation_map); > >>+ > >>+ alloc_bootmem_cpumask_var(&tick_nohz_full_mask); > >>+ cpumask_copy(tick_nohz_full_mask, task_isolation_map); > >>+ tick_nohz_full_running = true; > >How about calling tick_nohz_full_setup() instead? I'd rather prefer > >that nohz full implementation details stay in tick-sched.c > > > >Also what happens if nohz_full= is given as well as task_isolation= ? > >Don't we risk a memory leak and maybe breaking the fact that > >(nohz_full & task_isolation != task_isolation) which is really a requirement? > > Yeah, this is a good point. I'm not sure what the best way is to make > this happen. It's already true that we will leak memory if you > specify "nohz_full=" more than once on the command line, but it's > awkward to fix (assuming we want the last value to win) so maybe > we can just ignore this problem - it's a pretty small amount of memory > after all. If so, then making tick_nohz_full_setup() and > isolated_cpu_setup() > both non-static and calling them from task_isolation_setup() might > be the cleanest approach. What do you think? I think we can reuse tick_nohz_full_setup() indeed, or some of its internals and encapsulate that in a function so that isolation.c can initialize nohz full without fiddling with internal variables. > > You asked what happens if nohz_full= is given as well, which is a very > good question. Perhaps the right answer is to have an early_initcall > that suppresses task isolation on any cores that lost their nohz_full > or isolcpus status due to later boot command line arguments (and > generate a console warning, obviously). I'd rather imagine that the final nohz full cpumask is "nohz_full=" | "task_isolation=" That's the easiest way to deal with and both nohz and task isolation can call a common initializer that takes care of the allocation and add the cpus to the mask. > >>+int task_isolation_set(unsigned int flags) > >>+{ > >>+ if (cpumask_weight(tsk_cpus_allowed(current)) != 1 || > >>+ !task_isolation_possible(smp_processor_id())) > >>+ return -EINVAL; > >>+ > >>+ current->task_isolation_flags = flags; > >>+ return 0; > >>+} > >What if we concurrently change the task's affinity? Also it seems that preemption > >isn't disabled, so we can also migrate concurrently. I'm surprised you haven't > >seen warnings with smp_processor_id(). > > > >Also we should protect against task's affinity change when task_isolation_flags > >is set. > > I talked about this a bit when you raised it for the v8 patch series: > > http://lkml.kernel.org/r/562FA8FD.8080502@xxxxxxxxxx > > I'd be curious to hear your take on the arguments I made there. Oh ok, I'm going to reply there then :) > > You're absolutely right about the preemption warnings, which I only fixed > a few days ago. In this case I use raw_smp_processor_id() since with a > fixed single-core cpu affinity, we're not going anywhere, so the warning > from smp_processor_id() would be bogus. And although technically it is > still correct (racing with another task resetting the task affinity on this > one), it is in any case equivalent to having that other task reset the > affinity > on return from the prctl(), which I've already claimed isn't an interesting > use case to try to handle. But let me know what you think! Ok it's very much tied to the affinity issue. If we deal with affinity changes properly I think we can use the raw_ version. > > >>+ > >>+/* > >>+ * In task isolation mode we try to return to userspace only after > >>+ * attempting to make sure we won't be interrupted again. To handle > >>+ * the periodic scheduler tick, we test to make sure that the tick is > >>+ * stopped, and if it isn't yet, we request a reschedule so that if > >>+ * another task needs to run to completion first, it can do so. > >>+ * Similarly, if any other subsystems require quiescing, we will need > >>+ * to do that before we return to userspace. > >>+ */ > >>+bool _task_isolation_ready(void) > >>+{ > >>+ WARN_ON_ONCE(!irqs_disabled()); > >>+ > >>+ /* If we need to drain the LRU cache, we're not ready. */ > >>+ if (lru_add_drain_needed(smp_processor_id())) > >>+ return false; > >>+ > >>+ /* If vmstats need updating, we're not ready. */ > >>+ if (!vmstat_idle()) > >>+ return false; > >>+ > >>+ /* Request rescheduling unless we are in full dynticks mode. */ > >>+ if (!tick_nohz_tick_stopped()) { > >>+ set_tsk_need_resched(current); > >I'm not sure doing this will help getting the tick to get stopped. > > Well, I don't know that there is anything else we CAN do, right? If there's > another task that can run, great - it may be that that's why full dynticks > isn't happening yet. Or, it might be that we're waiting for an RCU tick and > there's nothing else we can do, in which case we basically spend our time > going around through the scheduler code and back out to the > task_isolation_ready() test, but again, there's really nothing else more > useful we can be doing at this point. Once the RCU tick fires (or whatever > it was that was preventing full dynticks from engaging), we will pass this > test and return to user space. There is nothing at all you can do and setting TIF_RESCHED won't help either. If there is another task that can run, the scheduler takes care of resched by itself :-) Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-api" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html