Re: [PATCH 3/4] Per cgroup background reclaim.

KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx> · Tue, 7 Dec 2010 14:21:24 +0900

On Mon, 6 Dec 2010 18:25:55 -0800
Ying Han <yinghan@xxxxxxxxxx> wrote:

> On Mon, Nov 29, 2010 at 11:51 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@xxxxxxxxxxxxxx> wrote:
> > On Mon, 29 Nov 2010 22:49:44 -0800
> > Ying Han <yinghan@xxxxxxxxxx> wrote:
> >
> >> The current implementation of memcg only supports direct reclaim and this
> >> patch adds the support for background reclaim. Per cgroup background reclaim
> >> is needed which spreads out the memory pressure over longer period of time
> >> and smoothes out the system performance.
> >>
> >> There is a kswapd kernel thread for each memory node. We add a different kswapd
> >> for each cgroup. The kswapd is sleeping in the wait queue headed at kswapd_wait
> >> field of a kswapd descriptor.
> >>
> >> The kswapd() function now is shared between global and per cgroup kswapd thread.
> >> It is passed in with the kswapd descriptor which contains the information of
> >> either node or cgroup. Then the new function balance_mem_cgroup_pgdat is invoked
> >> if it is per cgroup kswapd thread. The balance_mem_cgroup_pgdat performs a
> >> priority loop similar to global reclaim. In each iteration it invokes
> >> balance_pgdat_node for all nodes on the system, which is a new function performs
> >> background reclaim per node. After reclaiming each node, it checks
> >> mem_cgroup_watermark_ok() and breaks the priority loop if returns true. A per
> >> memcg zone will be marked as "unreclaimable" if the scanning rate is much
> >> greater than the reclaiming rate on the per cgroup LRU. The bit is cleared when
> >> there is a page charged to the cgroup being freed. Kswapd breaks the priority
> >> loop if all the zones are marked as "unreclaimable".
> >>
> >> Signed-off-by: Ying Han <yinghan@xxxxxxxxxx>
> >> ---
> >> Âinclude/linux/memcontrol.h | Â 30 +++++++
> >> Âmm/memcontrol.c Â Â Â Â Â Â| Â182 ++++++++++++++++++++++++++++++++++++++-
> >> Âmm/page_alloc.c Â Â Â Â Â Â| Â Â2 +
> >> Âmm/vmscan.c Â Â Â Â Â Â Â Â| Â205 +++++++++++++++++++++++++++++++++++++++++++-
> >> Â4 files changed, 416 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> >> index 90fe7fe..dbed45d 100644
> >> --- a/include/linux/memcontrol.h
> >> +++ b/include/linux/memcontrol.h
> >> @@ -127,6 +127,12 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> >> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â gfp_t gfp_mask);
> >> Âu64 mem_cgroup_get_limit(struct mem_cgroup *mem);
> >>
> >> +void mem_cgroup_clear_unreclaimable(struct page *page, struct zone *zone);
> >> +bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid, int zid);
> >> +bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem, struct zone *zone);
> >> +void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem, struct zone *zone);
> >> +void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem, struct zone* zone,
> >> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â unsigned long nr_scanned);
> >> Â#else /* CONFIG_CGROUP_MEM_RES_CTLR */
> >> Âstruct mem_cgroup;
> >>
> >> @@ -299,6 +305,25 @@ static inline void mem_cgroup_update_file_mapped(struct page *page,
> >> Â{
> >> Â}
> >>
> >> +static inline void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem,
> >> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â struct zone *zone,
> >> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â unsigned long nr_scanned)
> >> +{
> >> +}
> >> +
> >> +static inline void mem_cgroup_clear_unreclaimable(struct page *page,
> >> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â struct zone *zone)
> >> +{
> >> +}
> >> +static inline void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem,
> >> + Â Â Â Â Â Â struct zone *zone)
> >> +{
> >> +}
> >> +static inline bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem,
> >> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â struct zone *zone)
> >> +{
> >> +}
> >> +
> >> Âstatic inline
> >> Âunsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> >> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â gfp_t gfp_mask)
> >> @@ -312,6 +337,11 @@ u64 mem_cgroup_get_limit(struct mem_cgroup *mem)
> >> Â Â Â return 0;
> >> Â}
> >>
> >> +static inline bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid,
> >> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â int zid)
> >> +{
> >> + Â Â return false;
> >> +}
> >> Â#endif /* CONFIG_CGROUP_MEM_CONT */
> >>
> >> Â#endif /* _LINUX_MEMCONTROL_H */
> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >> index a0c6ed9..1d39b65 100644
> >> --- a/mm/memcontrol.c
> >> +++ b/mm/memcontrol.c
> >> @@ -48,6 +48,8 @@
> >> Â#include <linux/page_cgroup.h>
> >> Â#include <linux/cpu.h>
> >> Â#include <linux/oom.h>
> >> +#include <linux/kthread.h>
> >> +
> >> Â#include "internal.h"
> >>
> >> Â#include <asm/uaccess.h>
> >> @@ -118,7 +120,10 @@ struct mem_cgroup_per_zone {
> >> Â Â Â bool Â Â Â Â Â Â Â Â Â Âon_tree;
> >> Â Â Â struct mem_cgroup Â Â Â *mem; Â Â Â Â Â /* Back pointer, we cannot */
> >> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â /* use container_of Â Â Â Â*/
> >> + Â Â unsigned long Â Â Â Â Â pages_scanned; Â/* since last reclaim */
> >> + Â Â int Â Â Â Â Â Â Â Â Â Â all_unreclaimable; Â Â Â/* All pages pinned */
> >> Â};
> >> +
> >> Â/* Macro for accessing counter */
> >> Â#define MEM_CGROUP_ZSTAT(mz, idx) Â Â((mz)->count[(idx)])
> >>
> >> @@ -372,6 +377,7 @@ static void mem_cgroup_put(struct mem_cgroup *mem);
> >> Âstatic struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
> >> Âstatic void drain_all_stock_async(void);
> >> Âstatic unsigned long get_min_free_kbytes(struct mem_cgroup *mem);
> >> +static inline void wake_memcg_kswapd(struct mem_cgroup *mem);
> >>
> >> Âstatic struct mem_cgroup_per_zone *
> >> Âmem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
> >> @@ -1086,6 +1092,106 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page)
> >> Â Â Â return &mz->reclaim_stat;
> >> Â}
> >>
> >> +unsigned long mem_cgroup_zone_reclaimable_pages(
> >> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â struct mem_cgroup_per_zone *mz)
> >> +{
> >> + Â Â int nr;
> >> + Â Â nr = MEM_CGROUP_ZSTAT(mz, LRU_ACTIVE_FILE) +
> >> + Â Â Â Â Â Â MEM_CGROUP_ZSTAT(mz, LRU_INACTIVE_FILE);
> >> +
> >> + Â Â if (nr_swap_pages > 0)
> >> + Â Â Â Â Â Â nr += MEM_CGROUP_ZSTAT(mz, LRU_ACTIVE_ANON) +
> >> + Â Â Â Â Â Â Â Â Â Â MEM_CGROUP_ZSTAT(mz, LRU_INACTIVE_ANON);
> >> +
> >> + Â Â return nr;
> >> +}
> >> +
> >> +void mem_cgroup_mz_pages_scanned(struct mem_cgroup *mem, struct zone* zone,
> >> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â unsigned long nr_scanned)
> >> +{
> >> + Â Â struct mem_cgroup_per_zone *mz = NULL;
> >> + Â Â int nid = zone_to_nid(zone);
> >> + Â Â int zid = zone_idx(zone);
> >> +
> >> + Â Â if (!mem)
> >> + Â Â Â Â Â Â return;
> >> +
> >> + Â Â mz = mem_cgroup_zoneinfo(mem, nid, zid);
> >> + Â Â if (mz)
> >> + Â Â Â Â Â Â mz->pages_scanned += nr_scanned;
> >> +}
> >> +
> >> +bool mem_cgroup_zone_reclaimable(struct mem_cgroup *mem, int nid, int zid)
> >> +{
> >> + Â Â struct mem_cgroup_per_zone *mz = NULL;
> >> +
> >> + Â Â if (!mem)
> >> + Â Â Â Â Â Â return 0;
> >> +
> >> + Â Â mz = mem_cgroup_zoneinfo(mem, nid, zid);
> >> + Â Â if (mz)
> >> + Â Â Â Â Â Â return mz->pages_scanned <
> >> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â mem_cgroup_zone_reclaimable_pages(mz) * 6;
> >> + Â Â return 0;
> >> +}
> >> +
> >> +bool mem_cgroup_mz_unreclaimable(struct mem_cgroup *mem, struct zone *zone)
> >> +{
> >> + Â Â struct mem_cgroup_per_zone *mz = NULL;
> >> + Â Â int nid = zone_to_nid(zone);
> >> + Â Â int zid = zone_idx(zone);
> >> +
> >> + Â Â if (!mem)
> >> + Â Â Â Â Â Â return 0;
> >> +
> >> + Â Â mz = mem_cgroup_zoneinfo(mem, nid, zid);
> >> + Â Â if (mz)
> >> + Â Â Â Â Â Â return mz->all_unreclaimable;
> >> +
> >> + Â Â return 0;
> >> +}
> >> +
> >> +void mem_cgroup_mz_set_unreclaimable(struct mem_cgroup *mem, struct zone *zone)
> >> +{
> >> + Â Â struct mem_cgroup_per_zone *mz = NULL;
> >> + Â Â int nid = zone_to_nid(zone);
> >> + Â Â int zid = zone_idx(zone);
> >> +
> >> + Â Â if (!mem)
> >> + Â Â Â Â Â Â return;
> >> +
> >> + Â Â mz = mem_cgroup_zoneinfo(mem, nid, zid);
> >> + Â Â if (mz)
> >> + Â Â Â Â Â Â mz->all_unreclaimable = 1;
> >> +}
> >> +
> >> +void mem_cgroup_clear_unreclaimable(struct page *page, struct zone *zone)
> >> +{
> >> + Â Â struct mem_cgroup_per_zone *mz = NULL;
> >> + Â Â struct mem_cgroup *mem = NULL;
> >> + Â Â int nid = zone_to_nid(zone);
> >> + Â Â int zid = zone_idx(zone);
> >> + Â Â struct page_cgroup *pc = lookup_page_cgroup(page);
> >> +
> >> + Â Â if (unlikely(!pc))
> >> + Â Â Â Â Â Â return;
> >> +
> >> + Â Â rcu_read_lock();
> >> + Â Â mem = pc->mem_cgroup;
> >
> > This is incorrect. you have to do css_tryget(&mem->css) before rcu_read_unlock.
> >
> >> + Â Â rcu_read_unlock();
> >> +
> >> + Â Â if (!mem)
> >> + Â Â Â Â Â Â return;
> >> +
> >> + Â Â mz = mem_cgroup_zoneinfo(mem, nid, zid);
> >> + Â Â if (mz) {
> >> + Â Â Â Â Â Â mz->pages_scanned = 0;
> >> + Â Â Â Â Â Â mz->all_unreclaimable = 0;
> >> + Â Â }
> >> +
> >> + Â Â return;
> >> +}
> >> +
> >> Âunsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
> >> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â struct list_head *dst,
> >> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â unsigned long *scanned, int order,
> >> @@ -1887,6 +1993,20 @@ static int __mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask,
> >> Â Â Â struct res_counter *fail_res;
> >> Â Â Â unsigned long flags = 0;
> >> Â Â Â int ret;
> >> + Â Â unsigned long min_free_kbytes = 0;
> >> +
> >> + Â Â min_free_kbytes = get_min_free_kbytes(mem);
> >> + Â Â if (min_free_kbytes) {
> >> + Â Â Â Â Â Â ret = res_counter_charge(&mem->res, csize, CHARGE_WMARK_LOW,
> >> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â &fail_res);
> >> + Â Â Â Â Â Â if (likely(!ret)) {
> >> + Â Â Â Â Â Â Â Â Â Â return CHARGE_OK;
> >> + Â Â Â Â Â Â } else {
> >> + Â Â Â Â Â Â Â Â Â Â mem_over_limit = mem_cgroup_from_res_counter(fail_res,
> >> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â res);
> >> + Â Â Â Â Â Â Â Â Â Â wake_memcg_kswapd(mem_over_limit);
> >> + Â Â Â Â Â Â }
> >> + Â Â }
> >
> > I think this check can be moved out to periodic-check as threshould notifiers.
> 
> Yes. This will be changed in V2.
> 
> >
> >
> >
> >>
> >> Â Â Â ret = res_counter_charge(&mem->res, csize, CHARGE_WMARK_MIN, &fail_res);
> >>
> >> @@ -3037,6 +3157,7 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
> >> Â Â Â Â Â Â Â Â Â Â Â else
> >> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â memcg->memsw_is_minimum = false;
> >> Â Â Â Â Â Â Â }
> >> + Â Â Â Â Â Â setup_per_memcg_wmarks(memcg);
> >> Â Â Â Â Â Â Â mutex_unlock(&set_limit_mutex);
> >>
> >> Â Â Â Â Â Â Â if (!ret)
> >> @@ -3046,7 +3167,7 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
> >> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â MEM_CGROUP_RECLAIM_SHRINK);
> >> Â Â Â Â Â Â Â curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
> >> Â Â Â Â Â Â Â /* Usage is reduced ? */
> >> - Â Â Â Â Â Â if (curusage >= oldusage)
> >> + Â Â Â Â Â Â if (curusage >= oldusage)
> >> Â Â Â Â Â Â Â Â Â Â Â retry_count--;
> >> Â Â Â Â Â Â Â else
> >> Â Â Â Â Â Â Â Â Â Â Â oldusage = curusage;
> >
> > What's changed here ?
> >
> >> @@ -3096,6 +3217,7 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
> >> Â Â Â Â Â Â Â Â Â Â Â else
> >> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â memcg->memsw_is_minimum = false;
> >> Â Â Â Â Â Â Â }
> >> + Â Â Â Â Â Â setup_per_memcg_wmarks(memcg);
> >> Â Â Â Â Â Â Â mutex_unlock(&set_limit_mutex);
> >>
> >> Â Â Â Â Â Â Â if (!ret)
> >> @@ -4352,6 +4474,8 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
> >> Âstatic void __mem_cgroup_free(struct mem_cgroup *mem)
> >> Â{
> >> Â Â Â int node;
> >> + Â Â struct kswapd *kswapd_p;
> >> + Â Â wait_queue_head_t *wait;
> >>
> >> Â Â Â mem_cgroup_remove_from_trees(mem);
> >> Â Â Â free_css_id(&mem_cgroup_subsys, &mem->css);
> >> @@ -4360,6 +4484,15 @@ static void __mem_cgroup_free(struct mem_cgroup *mem)
> >> Â Â Â Â Â Â Â free_mem_cgroup_per_zone_info(mem, node);
> >>
> >> Â Â Â free_percpu(mem->stat);
> >> +
> >> + Â Â wait = mem->kswapd_wait;
> >> + Â Â kswapd_p = container_of(wait, struct kswapd, kswapd_wait);
> >> + Â Â if (kswapd_p) {
> >> + Â Â Â Â Â Â if (kswapd_p->kswapd_task)
> >> + Â Â Â Â Â Â Â Â Â Â kthread_stop(kswapd_p->kswapd_task);
> >> + Â Â Â Â Â Â kfree(kswapd_p);
> >> + Â Â }
> >> +
> >> Â Â Â if (sizeof(struct mem_cgroup) < PAGE_SIZE)
> >> Â Â Â Â Â Â Â kfree(mem);
> >> Â Â Â else
> >> @@ -4421,6 +4554,39 @@ int mem_cgroup_watermark_ok(struct mem_cgroup *mem,
> >> Â Â Â return ret;
> >> Â}
> >>
> >> +static inline
> >> +void wake_memcg_kswapd(struct mem_cgroup *mem)
> >> +{
> >> + Â Â wait_queue_head_t *wait;
> >> + Â Â struct kswapd *kswapd_p;
> >> + Â Â struct task_struct *thr;
> >> + Â Â static char memcg_name[PATH_MAX];
> >> +
> >> + Â Â if (!mem)
> >> + Â Â Â Â Â Â return;
> >> +
> >> + Â Â wait = mem->kswapd_wait;
> >> + Â Â kswapd_p = container_of(wait, struct kswapd, kswapd_wait);
> >> + Â Â if (!kswapd_p->kswapd_task) {
> >> + Â Â Â Â Â Â if (mem->css.cgroup)
> >> + Â Â Â Â Â Â Â Â Â Â cgroup_path(mem->css.cgroup, memcg_name, PATH_MAX);
> >> + Â Â Â Â Â Â else
> >> + Â Â Â Â Â Â Â Â Â Â sprintf(memcg_name, "no_name");
> >> +
> >> + Â Â Â Â Â Â thr = kthread_run(kswapd, kswapd_p, "kswapd%s", memcg_name);
> >
> > I don't think reusing the name of "kswapd" isn't good. and this name cannot
> > be long as PATH_MAX...IIUC, this name is for comm[] field which is 16bytes long.
> >
> > So, how about naming this as
> >
> > Â"memcg%d", mem->css.id ?
> >
> > Exporing css.id will be okay if necessary.
> 
> I am not if that is working since the mem->css hasn't been initialized
> during mem_cgroup_create(). So that is one of the reasons that I put
> the kswapd creation at triggering wmarks instead of creating cgroup,
> since I have all that information ready by the time.
> 
> However, I agree that adding into the cgroup creation is better for
> performance perspective since we won't add the overhead for the page
> allocation. ( Although only the first wmark triggering ). Any
> suggestion?
> 

Hmm, my recommendation is to start the thread when the limit is set.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxxx  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>