Re: [PATCH 8/8] memcg asyncrhouns reclaim workqueue

Ying Han <yinghan@xxxxxxxxxx> · Tue, 24 May 2011 22:51:55 -0700

On Sun, May 22, 2011 at 5:25 PM, KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx> wrote:

On Fri, 20 May 2011 18:26:40 -0700

Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote:

> On Sat, 21 May 2011 09:41:50 +0900 Hiroyuki Kamezawa <kamezawa.hiroyuki@xxxxxxxxx> wrote:

>

> > 2011/5/21 Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>:

> > > On Fri, 20 May 2011 12:48:37 +0900

> > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx> wrote:

> > >

> > >> workqueue for memory cgroup asynchronous memory shrinker.

> > >>

> > >> This patch implements the workqueue of async shrinker routine. each

> > >> memcg has a work and only one work can be scheduled at the same time.

> > >>

> > >> If shrinking memory doesn't goes well, delay will be added to the work.

> > >>

> > >

> > > When this code explodes (as it surely will), users will see large

> > > amounts of CPU consumption in the work queue thread. __We want to make

> > > this as easy to debug as possible, so we should try to make the

> > > workqueue's names mappable back onto their memcg's. __And anything else

> > > we can think of to help?

When we debug kswapd issues in the memory isolation environment, the first step is to identify which cgroup
the kswapd thread is working on. We need a easy way to make the direct mapping by reading a API or just
look at "top". So making the "kworkers" name mapped back to the memcg helps here.

Also, we need a easy way to track the amount of cputime consumed by the kswapd per-memcg basis. We probably
can export that number in the per-memcg memory.stats. Kame has the patch from the last post.

> > >

> >

> > I had a patch for showing per-memcg reclaim latency stats. It will be help.

> > I'll add it again to this set. I just dropped it because there are many patches

> > onto memory.stat in flight..

>

> Will that patch help us when users report the memcg equivalent of

> "kswapd uses 99% of CPU"?

>

I think so. Each memcg shows what amount of cpu is used.

But, maybe it's not an easy interface. I have several idea.

An idea I have is to rename task->comm by overwrite from  kworker/u:%d as

to memcg/%d when the work is scheduled. I think this can be implemented in very

simple interface and flags to workqueue. Then, ps -elf can show what was goin on.

If necessary, I'll add a hardlimit of cpu usage for a work or I'll limit

the number of thread for memcg workqueue.

Does it make sense to have memcg/css->id as the name if that is not the case yet? Otherwise, 
there is hard to link the kworker/%d ( or memcg/% later) back to the memcg it is working on.

On the last post of per-memcg-per-kswapd implementation, i have the thread named "memcg-css_id",
and also has a API per-memcg to export its css_id. So we can easily identify the kernel thread to its
owner.

Considering there are user who uses 2000+ memcg on a system, a thread per a memcg

was not a choice to me.

So that is only about 2000 * 8k = 16M worth of memory over the machine capacity (probably a very large number by
have the 2000+ memcgs running). We've run systems w/ 1000+ kswapds w/o noticing troubles on that. What that is
buying us is better visibility(more of cpu limit per memcg kswapd) and debug-ability.

Sorry I know we have discussed this before on other thread, but I can not prevent myself not repeating here again :(  Just want to provide a datapoint where we have lots of kswapd threads (> 1000) per host and that is not causing us any issues as you concerned. :)

--Ying

Another idea was thread poll or workqueue. Because thread

pool can be a poor reimplemenation of workqueue, I used workqueue.

I'll implement some idea in above to the next version.

> > >

> > >> + __ __ limit = res_counter_read_u64(&mem->res, RES_LIMIT);

> > >> + __ __ shrink_to = limit - MEMCG_ASYNC_MARGIN - PAGE_SIZE;

> > >> + __ __ usage = res_counter_read_u64(&mem->res, RES_USAGE);

> > >> + __ __ if (shrink_to <= usage) {

> > >> + __ __ __ __ __ __ required = usage - shrink_to;

> > >> + __ __ __ __ __ __ required = (required >> PAGE_SHIFT) + 1;

> > >> + __ __ __ __ __ __ /*

> > >> + __ __ __ __ __ __ __* This scans some number of pages and returns that memory

> > >> + __ __ __ __ __ __ __* reclaim was slow or now. If slow, we add a delay as

> > >> + __ __ __ __ __ __ __* congestion_wait() in vmscan.c

> > >> + __ __ __ __ __ __ __*/

> > >> + __ __ __ __ __ __ congested = mem_cgroup_shrink_static_scan(mem, (long)required);

> > >> + __ __ }

> > >> + __ __ if (test_bit(ASYNC_NORESCHED, &mem->async_flags)

> > >> + __ __ __ __ || mem_cgroup_async_should_stop(mem))

> > >> + __ __ __ __ __ __ goto finish_scan;

> > >> + __ __ /* If memory reclaim couldn't go well, add delay */

> > >> + __ __ if (congested)

> > >> + __ __ __ __ __ __ delay = HZ/10;

> > >

> > > Another magic number.

> > >

> > > If Moore's law holds, we need to reduce this number by 1.4 each year.

> > > Is this good?

> > >

> >

> > not good.  I just used the same magic number now used with wait_iff_congested.

> > Other than timer, I can use pagein/pageout event counter. If we have

> > dirty_ratio,

> > I may able to link this to dirty_ratio and wait until dirty_ratio is enough low.

> > Or, wake up again hit limit.

> >

> > Do you have suggestion ?

> >

>

> mm..  It would be pretty easy to generate an estimate of "pages scanned

> per second" from the contents of (and changes in) the scan_control.

Hmm.

> Konwing that datum and knowing the number of pages in the memcg, we

> should be able to come up with a delay period which scales

> appropriately with CPU speed and with memory size?

>

> Such a thing could be used to rationalise magic delays in other places,

> hopefully.

>

Ok, I'll conder that. Thank you for nice idea.

> >

> > >> + __ __ queue_delayed_work(memcg_async_shrinker, &mem->async_work, delay);

> > >> + __ __ return;

> > >> +finish_scan:

> > >> + __ __ cgroup_release_and_wakeup_rmdir(&mem->css);

> > >> + __ __ clear_bit(ASYNC_RUNNING, &mem->async_flags);

> > >> + __ __ return;

> > >> +}

> > >> +

> > >> +static void run_mem_cgroup_async_shrinker(struct mem_cgroup *mem)

> > >> +{

> > >> + __ __ if (test_bit(ASYNC_NORESCHED, &mem->async_flags))

> > >> + __ __ __ __ __ __ return;

> > >

> > > I can't work out what ASYNC_NORESCHED does. __Is its name well-chosen?

> > >

> > how about BLOCK/STOP_ASYNC_RECLAIM ?

>

> I can't say - I don't know what it does!  Or maybe I did, and immediately

> forgot ;)

>

I'll find a better name ;)

Thanks,

-Kame