On 05/04/23 14:05, Frederic Weisbecker wrote: > static void smp_call_function_many_cond(const struct cpumask *mask, > smp_call_func_t func, void *info, > @@ -946,10 +948,13 @@ static void smp_call_function_many_cond(const struct cpumask *mask, > #endif > cfd_seq_store(pcpu->seq_queue, this_cpu, cpu, CFD_SEQ_QUEUE); > if (llist_add(&csd->node.llist, &per_cpu(call_single_queue, cpu))) { > - __cpumask_set_cpu(cpu, cfd->cpumask_ipi); > - nr_cpus++; > - last_cpu = cpu; > - > + if (!(scf_flags & SCF_NO_USER) || > + !IS_ENABLED(CONFIG_GENERIC_ENTRY) || > + ct_state_cpu(cpu) != CONTEXT_USER) { > + __cpumask_set_cpu(cpu, cfd->cpumask_ipi); > + nr_cpus++; > + last_cpu = cpu; > + } I've been hacking on something like this (CSD deferral for NOHZ-full), and unfortunately this uses the CPU-local cfd_data storage thing, which means any further smp_call_function() from the same CPU to the same destination will spin on csd_lock_wait(), waiting for the target CPU to come out of userspace and flush the queue - and we've just spent extra effort into *not* disturbing it, so that'll take a while :( I don't have much that is in a shareable state yet (though I'm supposed to talk some more about it at OSPM in <2 weeks, so I'll have to get there), but ATM I'm playing with o a bitmask (like in [1]) for coalescable stuff such as do_sync_core() for x86 instruction patching o a CSD-like queue for things that need to pass data around, using statically-allocated storage (so with a limit on how much it can be used) - the alternative being allocating a struct on sending, since you don't have a bound on how much crap you can queue on an undisturbed NOHZ-full CPU... [1]: https://lore.kernel.org/all/20210929152429.067060646@xxxxxxxxxxxxx/