On 2020/7/7 下午11:23, Pekka Enberg wrote: > Hi! > > (Sorry for the delay, I missed your response.) > > On Fri, Jul 3, 2020 at 12:38 PM xunlei <xlpang@xxxxxxxxxxxxxxxxx> wrote: >> >> On 2020/7/2 PM 7:59, Pekka Enberg wrote: >>> On Thu, Jul 2, 2020 at 11:32 AM Xunlei Pang <xlpang@xxxxxxxxxxxxxxxxx> wrote: >>>> The node list_lock in count_partial() spend long time iterating >>>> in case of large amount of partial page lists, which can cause >>>> thunder herd effect to the list_lock contention, e.g. it cause >>>> business response-time jitters when accessing "/proc/slabinfo" >>>> in our production environments. >>> >>> Would you have any numbers to share to quantify this jitter? I have no >> >> We have HSF RT(High-speed Service Framework Response-Time) monitors, the >> RT figures fluctuated randomly, then we deployed a tool detecting "irq >> off" and "preempt off" to dump the culprit's calltrace, capturing the >> list_lock cost up to 100ms with irq off issued by "ss", this also caused >> network timeouts. > > Thanks for the follow up. This sounds like a good enough motivation > for this patch, but please include it in the changelog. > >>> objections to this approach, but I think the original design >>> deliberately made reading "/proc/slabinfo" more expensive to avoid >>> atomic operations in the allocation/deallocation paths. It would be >>> good to understand what is the gain of this approach before we switch >>> to it. Maybe even run some slab-related benchmark (not sure if there's >>> something better than hackbench these days) to see if the overhead of >>> this approach shows up. >> >> I thought that before, but most atomic operations are serialized by the >> list_lock. Another possible way is to hold list_lock in __slab_free(), >> then these two counters can be changed from atomic to long. >> >> I also have no idea what's the standard SLUB benchmark for the >> regression test, any specific suggestion? > > I don't know what people use these days. When I did benchmarking in > the past, hackbench and netperf were known to be slab-allocation > intensive macro-benchmarks. Christoph also had some SLUB > micro-benchmarks, but I don't think we ever merged them into the tree. I tested hackbench on 24-CPU machine, here are the results: "hackbench 20 thread 1000" == orignal(without any patch) Time: 53.793 Time: 54.305 Time: 54.073 == with my patch1~2 Time: 54.036 Time: 53.840 Time: 54.066 Time: 53.449 == with my patch1~2, plus using a percpu partial free objects counter Time: 53.303 Time: 52.994 Time: 53.218 Time: 53.268 Time: 53.739 Time: 53.072 The results show no performance regression, it's strange that the figures even get a little better when using percpu counter. Thanks, Xunlei