On Tue, May 17, 2022 at 11:27 PM Feng zhou <zhoufeng.zf@xxxxxxxxxxxxx> wrote: > > From: Feng Zhou <zhoufeng.zf@xxxxxxxxxxxxx> > > We encountered bad case on big system with 96 CPUs that > alloc_htab_elem() would last for 1ms. The reason is that after the > prealloc hashtab has no free elems, when trying to update, it will still > grab spin_locks of all cpus. If there are multiple update users, the > competition is very serious. > > So this patch add is_empty in pcpu_freelist_head to check freelist > having free or not. If having, grab spin_lock, or check next cpu's > freelist. > > Before patch: hash_map performance > ./map_perf_test 1 > 0:hash_map_perf pre-alloc 975345 events per sec > 4:hash_map_perf pre-alloc 855367 events per sec > 12:hash_map_perf pre-alloc 860862 events per sec > 8:hash_map_perf pre-alloc 849561 events per sec > 3:hash_map_perf pre-alloc 849074 events per sec > 6:hash_map_perf pre-alloc 847120 events per sec > 10:hash_map_perf pre-alloc 845047 events per sec > 5:hash_map_perf pre-alloc 841266 events per sec > 14:hash_map_perf pre-alloc 849740 events per sec > 2:hash_map_perf pre-alloc 839598 events per sec > 9:hash_map_perf pre-alloc 838695 events per sec > 11:hash_map_perf pre-alloc 845390 events per sec > 7:hash_map_perf pre-alloc 834865 events per sec > 13:hash_map_perf pre-alloc 842619 events per sec > 1:hash_map_perf pre-alloc 804231 events per sec > 15:hash_map_perf pre-alloc 795314 events per sec > > hash_map the worst: no free > ./map_perf_test 2048 > 6:worse hash_map_perf pre-alloc 28628 events per sec > 5:worse hash_map_perf pre-alloc 28553 events per sec > 11:worse hash_map_perf pre-alloc 28543 events per sec > 3:worse hash_map_perf pre-alloc 28444 events per sec > 1:worse hash_map_perf pre-alloc 28418 events per sec > 7:worse hash_map_perf pre-alloc 28427 events per sec > 13:worse hash_map_perf pre-alloc 28330 events per sec > 14:worse hash_map_perf pre-alloc 28263 events per sec > 9:worse hash_map_perf pre-alloc 28211 events per sec > 15:worse hash_map_perf pre-alloc 28193 events per sec > 12:worse hash_map_perf pre-alloc 28190 events per sec > 10:worse hash_map_perf pre-alloc 28129 events per sec > 8:worse hash_map_perf pre-alloc 28116 events per sec > 4:worse hash_map_perf pre-alloc 27906 events per sec > 2:worse hash_map_perf pre-alloc 27801 events per sec > 0:worse hash_map_perf pre-alloc 27416 events per sec > 3:worse hash_map_perf pre-alloc 28188 events per sec > > ftrace trace > > 0) | htab_map_update_elem() { > 0) 0.198 us | migrate_disable(); > 0) | _raw_spin_lock_irqsave() { > 0) 0.157 us | preempt_count_add(); > 0) 0.538 us | } > 0) 0.260 us | lookup_elem_raw(); > 0) | alloc_htab_elem() { > 0) | __pcpu_freelist_pop() { > 0) | _raw_spin_lock() { > 0) 0.152 us | preempt_count_add(); > 0) 0.352 us | native_queued_spin_lock_slowpath(); > 0) 1.065 us | } > | ... > 0) | _raw_spin_unlock() { > 0) 0.254 us | preempt_count_sub(); > 0) 0.555 us | } > 0) + 25.188 us | } > 0) + 25.486 us | } > 0) | _raw_spin_unlock_irqrestore() { > 0) 0.155 us | preempt_count_sub(); > 0) 0.454 us | } > 0) 0.148 us | migrate_enable(); > 0) + 28.439 us | } > > The test machine is 16C, trying to get spin_lock 17 times, in addition > to 16c, there is an extralist. Is this with small max_entries and a large number of cpus? If so, probably better to fix would be to artificially bump max_entries to be 4x of num_cpus. Racy is_empty check still wastes the loop.