Re: [PATCH V4 06/10] Per-memcg background reclaim.

Ying Han <yinghan@xxxxxxxxxx> · Fri, 15 Apr 2011 11:00:08 -0700

On Fri, Apr 15, 2011 at 1:14 AM, KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx> wrote:

On Thu, 14 Apr 2011 23:08:40 -0700

Ying Han <yinghan@xxxxxxxxxx> wrote:

> On Thu, Apr 14, 2011 at 6:11 PM, KAMEZAWA Hiroyuki <

> kamezawa.hiroyu@xxxxxxxxxxxxxx> wrote:

> >

> > As you know, memcg works against user's memory, memory should be in highmem

> > zone.

> > Memcg-kswapd is not for memory-shortage, but for voluntary page dropping by

> > _user_.

> >

>

> in some sense, yes. but it would also related to memory-shortage on fully

> packed machines.

>

No. _at this point_, this is just for freeing volutary before hitting limit

to gain performance. Anyway, this understainding is not affecting the patch

itself.

> >

> > If this memcg-kswapd drops pages from lower zones first, ah, ok, it's good

> > for

> > the system because memcg's pages should be on higher zone if we have free

> > memory.

> >

> > So, I think the reason for dma->highmem is different from global kswapd.

> >

>

> yes. I agree that the logic of dma->highmem ordering is not exactly the same

> from per-memcg kswapd and per-node kswapd. But still the page allocation

> happens on the other side, and this is still good for the system as you

> pointed out.

>

> >

> >

> >

> >

> > > +     for (i = 0; i < pgdat->nr_zones; i++) {

> > > +             struct zone *zone = pgdat->node_zones + i;

> > > +

> > > +             if (!populated_zone(zone))

> > > +                     continue;

> > > +

> > > +             sc->nr_scanned = 0;

> > > +             shrink_zone(priority, zone, sc);

> > > +             total_scanned += sc->nr_scanned;

> > > +

> > > +             /*

> > > +              * If we've done a decent amount of scanning and

> > > +              * the reclaim ratio is low, start doing writepage

> > > +              * even in laptop mode

> > > +              */

> > > +             if (total_scanned > SWAP_CLUSTER_MAX * 2 &&

> > > +                 total_scanned > sc->nr_reclaimed + sc->nr_reclaimed /

> > 2) {

> > > +                     sc->may_writepage = 1;

> > > +             }

> > > +     }

> > > +

> > > +     sc->nr_scanned = total_scanned;

> > > +     return;

> > > +}

> > > +

> > > +/*

> > > + * Per cgroup background reclaim.

> > > + * TODO: Take off the order since memcg always do order 0

> > > + */

> > > +static unsigned long balance_mem_cgroup_pgdat(struct mem_cgroup

> > *mem_cont,

> > > +                                           int order)

> > > +{

> > > +     int i, nid;

> > > +     int start_node;

> > > +     int priority;

> > > +     bool wmark_ok;

> > > +     int loop;

> > > +     pg_data_t *pgdat;

> > > +     nodemask_t do_nodes;

> > > +     unsigned long total_scanned;

> > > +     struct scan_control sc = {

> > > +             .gfp_mask = GFP_KERNEL,

> > > +             .may_unmap = 1,

> > > +             .may_swap = 1,

> > > +             .nr_to_reclaim = ULONG_MAX,

> > > +             .swappiness = vm_swappiness,

> > > +             .order = order,

> > > +             .mem_cgroup = mem_cont,

> > > +     };

> > > +

> > > +loop_again:

> > > +     do_nodes = NODE_MASK_NONE;

> > > +     sc.may_writepage = !laptop_mode;

> >

> > I think may_writepage should start from '0' always. We're not sure

> > the system is in memory shortage...we just want to release memory

> > volunatary. write_page will add huge costs, I guess.

> >

> > For exmaple,

> >        sc.may_writepage = !!loop

> > may be better for memcg.

> >

> > BTW, you set nr_to_reclaim as ULONG_MAX here and doesn't modify it later.

> >

> > I think you should add some logic to fix it to right value.

> >

> > For example, before calling shrink_zone(),

> >

> > sc->nr_to_reclaim = min(SWAP_CLUSETR_MAX, memcg_usage_in_this_zone() /

> > 100);  # 1% in this zone.

> >

> > if we love 'fair pressure for each zone'.

> >

>

> Hmm. I don't get it. Leaving the nr_to_reclaim to be ULONG_MAX in kswapd

> case which is intended to add equal memory pressure for each zone.

And it need to reclaim memory from the zone.

memcg can visit other zone/node because it's not work for zone/pgdat.

> So in the shrink_zone, we won't bail out in the following condition:

>

>

> >-------while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||

> > >------->------->------->------->-------nr[LRU_INACTIVE_FILE]) {

> >

>

>  >------->-------if (nr_reclaimed >= nr_to_reclaim && priority <

> DEF_PRIORITY)

> >------->------->-------break;

>

> }

Yes. So, by setting nr_to_reclaim to be proper value for a zone,

we can visit next zone/node sooner. memcg's kswapd is not requrested to

free memory from a node/zone. (But we'll need a hint for bias, later.)

By making nr_reclaimed to be ULONG_MAX, to quit this loop, we need to

loop until all nr[lru] to be 0. When memcg kswapd finds that memcg's usage

is difficult to be reduced under high_wmark, priority goes up dramatically

and we'll see long loop in this zone if zone is busy.

For memcg kswapd, it can visit next zone rather than loop more. Then,

we'll be able to reduce cpu usage and contention by memcg_kswapd.

I think this do-more/skip-and-next logic will be a difficult issue

and need to be maintained with long time research. For now, I bet

ULONG_MAX is not a choice. As usual try_to_free_page() does,

SWAP_CLUSTER_MAX will be enough. As it is, we can visit next node.

fair enough and make sense. I will make the change on the next post.

--Ying 

Thanks,

-Kame