在 2021/10/19 下午12:43, Alexei Starovoitov 写道: > On Mon, Oct 18, 2021 at 9:31 PM Chengming Zhou > <zhouchengming@xxxxxxxxxxxxx> wrote: >> >> 在 2021/10/19 上午11:45, Alexei Starovoitov 写道: >>> On Mon, Oct 18, 2021 at 8:14 PM Chengming Zhou >>> <zhouchengming@xxxxxxxxxxxxx> wrote: >>>> >>>> 在 2021/10/19 上午9:57, Alexei Starovoitov 写道: >>>>> On Sun, Oct 17, 2021 at 10:49 PM Chengming Zhou >>>>> <zhouchengming@xxxxxxxxxxxxx> wrote: >>>>>> >>>>>> 在 2021/10/16 上午3:58, Alexei Starovoitov 写道: >>>>>>> On Fri, Oct 15, 2021 at 11:04 AM Chengming Zhou >>>>>>> <zhouchengming@xxxxxxxxxxxxx> wrote: >>>>>>>> >>>>>>>> We only use count for kmalloc hashtab not for prealloc hashtab, because >>>>>>>> __pcpu_freelist_pop() return NULL when no more elem in pcpu freelist. >>>>>>>> >>>>>>>> But the problem is that __pcpu_freelist_pop() will traverse all CPUs and >>>>>>>> spin_lock for all CPUs to find there is no more elem at last. >>>>>>>> >>>>>>>> We encountered bad case on big system with 96 CPUs that alloc_htab_elem() >>>>>>>> would last for 1ms. This patch use count for prealloc hashtab too, >>>>>>>> avoid traverse and spin_lock for all CPUs in this case. >>>>>>>> >>>>>>>> Signed-off-by: Chengming Zhou <zhouchengming@xxxxxxxxxxxxx> >>>>>>> >>>>>>> It's not clear from the commit log what you're solving. >>>>>>> The atomic inc/dec in critical path of prealloc maps hurts performance. >>>>>>> That's why it's not used. >>>>>>> >>>>>> Thanks for the explanation, what I'm solving is when hash table hasn't free >>>>>> elements, we don't need to call __pcpu_freelist_pop() to traverse and >>>>>> spin_lock all CPUs. The ftrace output of this bad case is below: >>>>>> >>>>>> 50) | htab_map_update_elem() { >>>>>> 50) 0.329 us | _raw_spin_lock_irqsave(); >>>>>> 50) 0.063 us | lookup_elem_raw(); >>>>>> 50) | alloc_htab_elem() { >>>>>> 50) | pcpu_freelist_pop() { >>>>>> 50) 0.209 us | _raw_spin_lock(); >>>>>> 50) 0.264 us | _raw_spin_lock(); >>>>> >>>>> This is LRU map. Not hash map. >>>>> It will grab spin_locks of other cpus >>>>> only if all previous cpus don't have free elements. >>>>> Most likely your map is actually full and doesn't have any free elems. >>>>> Since it's an lru it will force free an elem eventually. >>>>> >>>> >>>> Maybe I missed something, the map_update_elem function of LRU map is >>>> htab_lru_map_update_elem() and the htab_map_update_elem() above is the >>>> map_update_elem function of hash map. >>>> Because of the implementation of percpu freelist used in hash map, it >>>> will spin_lock all other CPUs when there is no free elements. >>> >>> Ahh. Right. Then what's the point of optimizing the error case >>> at the expense of the fast path? >>> >> >> Yes, this patch only optimized the very bad case that no free elements left, >> and add atomic operation in the fast path. Maybe the better workaround is not >> allowing full hash map in our case or using LRU map? > > No idea, since you haven't explained your use case. > >> But the problem of spinlock contention also exist even when the map is not full, >> like some CPUs run out of its freelist but other CPUs seldom used, then have to >> grab those CPUs' spinlock to get free element. > > In theory that would be correct. Do you see it in practice? > Please describe the use case. > >> Should we change the current implementation of percpu freelist to percpu lockless freelist? > > Like llist.h ? That was tried already and for typical hash map usage > it's slower than percpu free list. > Many progs still do a lot of hash map update/delete on all cpus at once. > That is the use case hashmap optimized for. > Please see commit 6c9059817432 ("bpf: pre-allocate hash map elements") > that also lists different alternative algorithms that were benchmarked. > Ok, I will figure out our use case, try these alternatives and collect some data first. Thanks for your explanation.