Re: [memcg] 45208c9105: aim7.jobs-per-min -14.0% regression

Shakeel Butt <shakeelb@xxxxxxxxxx> · Mon, 13 Sep 2021 12:40:06 -0700

+Tejun Heo

[ threads start at
https://lore.kernel.org/all/20210905124439.GA15026@xsang-OptiPlex-9020/T/#ma938a101f415ad784ac08612c7ef31f260a2b678]

On Mon, Sep 13, 2021 at 9:41 AM Shakeel Butt <shakeelb@xxxxxxxxxx> wrote:
>
> On Sun, Sep 12, 2021 at 6:29 AM Feng Tang <feng.tang@xxxxxxxxx> wrote:
> >
> > On Sun, Sep 12, 2021 at 07:17:56PM +0800, Hillf Danton wrote:
> > [...]
> > > > +// if (!(__this_cpu_inc_return(stats_flush_threshold) % MEMCG_CHARGE_BATCH))
> > > > +   if (!(__this_cpu_inc_return(stats_flush_threshold) % 128))
> > > >             queue_work(system_unbound_wq, &stats_flush_work);
> > > >  }
> > >
> > > Hi Feng,
> > >
> > > Would you please check if it helps fix the regression to avoid queuing a
> > > queued work by adding and checking an atomic counter.
> >
> > Hi Hillf,
> >
> > I just tested your patch, and it didn't recover the regression, but
> > just reduced it from -14% to around -13%, similar to the patch
> > increasing the batch charge number.
> >
>
> Thanks Hillf for taking a look and Feng for running the test.
>
> This shows that parallel calls to queue_work() is not the issue (there
> is already a test and set at the start of queue_work()) but the actual
> work done by queue_work() is costly for this code path.
>
> I wrote a simple anon page fault nohuge c program, profiled it and it
> seems like queue_work() is significant enough.
>
>    - 51.00% do_anonymous_page
>       + 16.68% alloc_pages_vma
>         11.61% _raw_spin_lock
>       + 10.26% mem_cgroup_charge
>       - 5.25% lru_cache_add_inactive_or_unevictable
>          - 4.48% __pagevec_lru_add
>             - 3.71% __pagevec_lru_add_fn
>                - 1.74% __mod_lruvec_state
>                   - 1.60% __mod_memcg_lruvec_state
>                      - 1.35% queue_work_on
>                         - __queue_work
>                            - 0.93% wake_up_process
>                               - try_to_wake_up
>                                  - 0.82% ttwu_queue
>                                       0.61% ttwu_do_activate
>       - 2.97% page_add_new_anon_rmap
>          - 2.68% __mod_lruvec_page_state
>             - 2.48% __mod_memcg_lruvec_state
>                - 1.67% queue_work_on
>                   - 1.53% __queue_work
>                      - 1.25% wake_up_process
>                         - try_to_wake_up
>                            - 0.94% ttwu_queue
>                               + 0.70% ttwu_do_activate
>                  0.61% cgroup_rstat_updated
>         2.10% rcu_read_unlock_strict
>         1.40% cgroup_throttle_swaprate
>
> However when I switch the batch size to 128, it goes away.
>

I did one more experiment with same workload but with system_wq
instead system_unbound_wq and there is clear difference in profile:

With system_unbound_wq:
-    4.63%     0.33%  mmap  [kernel.kallsyms]  [k] queue_work_on
     4.29% queue_work_on
      - __queue_work
         - 3.45% wake_up_process
            - try_to_wake_up
               - 2.46% ttwu_queue
                  - 1.66% ttwu_do_activate
                     - 1.14% activate_task
                        - 0.97% enqueue_task_fair
                             enqueue_entity

With system_wq:
-    1.36%     0.06%  mmap  [kernel.kallsyms]  [k] queue_work_on
     1.30% queue_work_on
      - __queue_work
         - 1.03% wake_up_process
            - try_to_wake_up
               - 0.97% ttwu_queue
                    0.66% ttwu_do_activate

Tejun, is this expected? i.e. queuing work on system_wq has a
different performance impact than on system_unbound_wq?