On Fri, 28 Jan 2011 12:28:32 +0900 KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx> wrote: > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx> > > When using khugepaged with small memory cgroup, we see khugepaged > causes soft lockup, or running process under memcg will hang > > It's because khugepaged tries to scan all pmd of a process > which is under busy/small memory cgroup and tries to allocate > HUGEPAGE size resource. > > This work is done under mmap_sem and can cause memory reclaim > repeatedly. This will easily raise cpu usage of khugepaged and latecy > of scanned process will goes up. Moreover, it seems succesfully > working TransHuge pages may be splitted by this memory reclaim > caused by khugepaged. > > This patch adds a hint for khugepaged whether a process is > under a memory cgroup which has sufficient memory. If memcg > seems busy, a process is skipped. > > How to test: > # mount -o cgroup cgroup /cgroup/memory -o memory > # mkdir /cgroup/memory/A > # echo 200M (or some small) > /cgroup/memory/A/memory.limit_in_bytes > # echo 0 > /cgroup/memory/A/tasks > # make -j 8 kernel > > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx> > --- > include/linux/memcontrol.h | 7 +++++ > mm/huge_memory.c | 10 +++++++- > mm/memcontrol.c | 53 +++++++++++++++++++++++++++++++++++++++++++++ > 3 files changed, 69 insertions(+), 1 deletion(-) > > Index: mmotm-0125/mm/memcontrol.c > =================================================================== > --- mmotm-0125.orig/mm/memcontrol.c > +++ mmotm-0125/mm/memcontrol.c > @@ -255,6 +255,9 @@ struct mem_cgroup { > /* For oom notifier event fd */ > struct list_head oom_notify; > > + /* For transparent hugepage daemon */ > + unsigned long long recent_failcnt; > + > /* > * Should we move charges of a task when a task is moved into this > * mem_cgroup ? And what type of charges should we move ? > @@ -2214,6 +2217,56 @@ void mem_cgroup_split_huge_fixup(struct > tail_pc->flags = head_pc->flags & ~PCGF_NOCOPY_AT_SPLIT; > move_unlock_page_cgroup(head_pc, &flags); > } > + > +bool mem_cgroup_worth_try_hugepage_scan(struct mm_struct *mm) > +{ > + struct mem_cgroup *mem; > + bool ret = true; > + u64 recent_charge_fail; > + > + if (mem_cgroup_disabled()) > + return true; > + > + mem = try_get_mem_cgroup_from_mm(mm); > + > + if (!mem) > + return true; > + > + if (mem_cgroup_is_root(mem)) > + goto out; > + > + /* > + * At collapsing, khugepaged charges HPAGE_SIZE. When it unmap > + * used ptes, the charge will be decreased. > + * > + * This requirement of 'extra charge' at collapsing seems redundant > + * it's safe way for now. For example, at replacing a chunk of page > + * to be hugepage, khuepaged skips pte_none() entry, which is not > + * which is not charged. But we should do charge under spinlocks as > + * pte_lock, we need precharge. Check status before doing heavy > + * jobs and give khugepaged chance to retire early. > + */ > + if (mem_cgroup_check_margin(mem) >= HPAGE_SIZE) I'm sorry if I misunderstand, shouldn't it be "<" ? Thanks, Daisuke Nishimura. > + ret = false; > + > + /* > + * This is an easy check. If someone other than khugepaged does > + * hit limit, khugepaged should avoid more pressure. > + */ > + recent_charge_fail = res_counter_read_u64(&mem->res, RES_FAILCNT); > + if (ret > + && mem->recent_failcnt > + && recent_charge_fail > mem->recent_failcnt) { > + ret = false; > + } > + /* because this thread will fail charge by itself +1.*/ > + if (recent_charge_fail) > + mem->recent_failcnt = recent_charge_fail + 1; > +out: > + css_put(&mem->css); > + return ret; > +} > + > #endif > > /** > Index: mmotm-0125/mm/huge_memory.c > =================================================================== > --- mmotm-0125.orig/mm/huge_memory.c > +++ mmotm-0125/mm/huge_memory.c > @@ -2011,8 +2011,10 @@ static unsigned int khugepaged_scan_mm_s > down_read(&mm->mmap_sem); > if (unlikely(khugepaged_test_exit(mm))) > vma = NULL; > - else > + else if (mem_cgroup_worth_try_hugepage_scan(mm)) > vma = find_vma(mm, khugepaged_scan.address); > + else > + vma = NULL; > > progress++; > for (; vma; vma = vma->vm_next) { > @@ -2024,6 +2026,12 @@ static unsigned int khugepaged_scan_mm_s > break; > } > > + if (unlikely(!mem_cgroup_worth_try_hugepage_scan(mm))) { > + progress++; > + vma = NULL; /* try next mm */ > + break; > + } > + > if ((!(vma->vm_flags & VM_HUGEPAGE) && > !khugepaged_always()) || > (vma->vm_flags & VM_NOHUGEPAGE)) { > Index: mmotm-0125/include/linux/memcontrol.h > =================================================================== > --- mmotm-0125.orig/include/linux/memcontrol.h > +++ mmotm-0125/include/linux/memcontrol.h > @@ -148,6 +148,7 @@ u64 mem_cgroup_get_limit(struct mem_cgro > > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > void mem_cgroup_split_huge_fixup(struct page *head, struct page *tail); > +bool mem_cgroup_worth_try_hugepage_scan(struct mm_struct *mm); > #endif > > #else /* CONFIG_CGROUP_MEM_RES_CTLR */ > @@ -342,6 +343,12 @@ u64 mem_cgroup_get_limit(struct mem_cgro > static inline void mem_cgroup_split_huge_fixup(struct page *head, > struct page *tail) > { > + > +} > + > +static inline bool mem_cgroup_worth_try_hugepage_scan(struct mm_struct *mm) > +{ > + return true; > } > > #endif /* CONFIG_CGROUP_MEM_CONT */ > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>