+Tejun Heo [ threads start at https://lore.kernel.org/all/20210905124439.GA15026@xsang-OptiPlex-9020/T/#ma938a101f415ad784ac08612c7ef31f260a2b678] On Mon, Sep 13, 2021 at 9:41 AM Shakeel Butt <shakeelb@xxxxxxxxxx> wrote: > > On Sun, Sep 12, 2021 at 6:29 AM Feng Tang <feng.tang@xxxxxxxxx> wrote: > > > > On Sun, Sep 12, 2021 at 07:17:56PM +0800, Hillf Danton wrote: > > [...] > > > > +// if (!(__this_cpu_inc_return(stats_flush_threshold) % MEMCG_CHARGE_BATCH)) > > > > + if (!(__this_cpu_inc_return(stats_flush_threshold) % 128)) > > > > queue_work(system_unbound_wq, &stats_flush_work); > > > > } > > > > > > Hi Feng, > > > > > > Would you please check if it helps fix the regression to avoid queuing a > > > queued work by adding and checking an atomic counter. > > > > Hi Hillf, > > > > I just tested your patch, and it didn't recover the regression, but > > just reduced it from -14% to around -13%, similar to the patch > > increasing the batch charge number. > > > > Thanks Hillf for taking a look and Feng for running the test. > > This shows that parallel calls to queue_work() is not the issue (there > is already a test and set at the start of queue_work()) but the actual > work done by queue_work() is costly for this code path. > > I wrote a simple anon page fault nohuge c program, profiled it and it > seems like queue_work() is significant enough. > > - 51.00% do_anonymous_page > + 16.68% alloc_pages_vma > 11.61% _raw_spin_lock > + 10.26% mem_cgroup_charge > - 5.25% lru_cache_add_inactive_or_unevictable > - 4.48% __pagevec_lru_add > - 3.71% __pagevec_lru_add_fn > - 1.74% __mod_lruvec_state > - 1.60% __mod_memcg_lruvec_state > - 1.35% queue_work_on > - __queue_work > - 0.93% wake_up_process > - try_to_wake_up > - 0.82% ttwu_queue > 0.61% ttwu_do_activate > - 2.97% page_add_new_anon_rmap > - 2.68% __mod_lruvec_page_state > - 2.48% __mod_memcg_lruvec_state > - 1.67% queue_work_on > - 1.53% __queue_work > - 1.25% wake_up_process > - try_to_wake_up > - 0.94% ttwu_queue > + 0.70% ttwu_do_activate > 0.61% cgroup_rstat_updated > 2.10% rcu_read_unlock_strict > 1.40% cgroup_throttle_swaprate > > However when I switch the batch size to 128, it goes away. > I did one more experiment with same workload but with system_wq instead system_unbound_wq and there is clear difference in profile: With system_unbound_wq: - 4.63% 0.33% mmap [kernel.kallsyms] [k] queue_work_on 4.29% queue_work_on - __queue_work - 3.45% wake_up_process - try_to_wake_up - 2.46% ttwu_queue - 1.66% ttwu_do_activate - 1.14% activate_task - 0.97% enqueue_task_fair enqueue_entity With system_wq: - 1.36% 0.06% mmap [kernel.kallsyms] [k] queue_work_on 1.30% queue_work_on - __queue_work - 1.03% wake_up_process - try_to_wake_up - 0.97% ttwu_queue 0.66% ttwu_do_activate Tejun, is this expected? i.e. queuing work on system_wq has a different performance impact than on system_unbound_wq?