KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx> writes: > Fixed one here. > == > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx> > > Now, at calculating dirty limit, vm_dirty_param() is called. > This function returns dirty-limit related parameters considering > memory cgroup settings. > > Now, assume that vm_dirty_bytes=100M (global dirty limit) and > memory cgroup has 1G of pages and 40 dirty_ratio, dirtyable memory is > 500MB. > > In this case, global_dirty_limits will consider dirty_limt as > 500 *0.4 = 200MB. This is bad...memory cgroup is not back door. > > This patch limits the return value of vm_dirty_param() considring > global settings. > > Changelog: > - fixed an argument "mem" int to u64 > - fixed to use global available memory to cap memcg's value. > > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx> > --- > include/linux/memcontrol.h | 5 +++-- > mm/memcontrol.c | 30 +++++++++++++++++++++++++++++- > mm/page-writeback.c | 3 ++- > 3 files changed, 34 insertions(+), 4 deletions(-) > > Index: dirty_limit_new/mm/memcontrol.c > =================================================================== > --- dirty_limit_new.orig/mm/memcontrol.c > +++ dirty_limit_new/mm/memcontrol.c > @@ -1171,9 +1171,11 @@ static void __mem_cgroup_dirty_param(str > * can be moved after our access and writeback tends to take long time. At > * least, "memcg" will not be freed while holding rcu_read_lock(). > */ > -void vm_dirty_param(struct vm_dirty_param *param) > +void vm_dirty_param(struct vm_dirty_param *param, > + u64 mem, u64 global) > { > struct mem_cgroup *memcg; > + u64 limit, bglimit; > > if (mem_cgroup_disabled()) { > global_vm_dirty_param(param); > @@ -1183,6 +1185,32 @@ void vm_dirty_param(struct vm_dirty_para > rcu_read_lock(); > memcg = mem_cgroup_from_task(current); > __mem_cgroup_dirty_param(param, memcg); > + /* > + * A limitation under memory cgroup is under global vm, too. > + */ > + if (vm_dirty_ratio) > + limit = global * vm_dirty_ratio / 100; > + else > + limit = vm_dirty_bytes; > + if (param->dirty_ratio) { > + param->dirty_bytes = mem * param->dirty_ratio / 100; > + param->dirty_ratio = 0; > + } > + if (param->dirty_bytes > limit) > + param->dirty_bytes = limit; > + > + if (dirty_background_ratio) > + bglimit = global * dirty_background_ratio / 100; > + else > + bglimit = dirty_background_bytes; > + > + if (param->dirty_background_ratio) { > + param->dirty_background_bytes = > + mem * param->dirty_background_ratio / 100; > + param->dirty_background_ratio = 0; > + } > + if (param->dirty_background_bytes > bglimit) > + param->dirty_background_bytes = bglimit; > rcu_read_unlock(); > } > > Index: dirty_limit_new/include/linux/memcontrol.h > =================================================================== > --- dirty_limit_new.orig/include/linux/memcontrol.h > +++ dirty_limit_new/include/linux/memcontrol.h > @@ -171,7 +171,7 @@ static inline void mem_cgroup_dec_page_s > } > > bool mem_cgroup_has_dirty_limit(void); > -void vm_dirty_param(struct vm_dirty_param *param); > +void vm_dirty_param(struct vm_dirty_param *param, u64 mem, u64 global); > s64 mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item); > > unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order, > @@ -360,7 +360,8 @@ static inline bool mem_cgroup_has_dirty_ > return false; > } > > -static inline void vm_dirty_param(struct vm_dirty_param *param) > +static inline void vm_dirty_param(struct vm_dirty_param *param, > + u64 mem, u64 global) > { > global_vm_dirty_param(param); > } > Index: dirty_limit_new/mm/page-writeback.c > =================================================================== > --- dirty_limit_new.orig/mm/page-writeback.c > +++ dirty_limit_new/mm/page-writeback.c > @@ -466,7 +466,8 @@ void global_dirty_limits(unsigned long * > struct task_struct *tsk; > struct vm_dirty_param dirty_param; > > - vm_dirty_param(&dirty_param); > + vm_dirty_param(&dirty_param, > + available_memory, global_dirtyable_memory()); > > if (dirty_param.dirty_bytes) > dirty = DIV_ROUND_UP(dirty_param.dirty_bytes, PAGE_SIZE); I think there is a problem with the patch above. In the patch vm_dirty_param() sets param->dirty_[background_]bytes to the smallest limits considering the memcg and global limits. Assuming the current task is in a memcg, then the memcg dirty (not system-wide) usage is always compared to the selected limits (which may be per-memcg or system). The problem is that if: a) per-memcg dirty limit is smaller than system then vm_dirty_param() will select per-memcg dirty limit, and b) per-memcg dirty usage is well below memcg dirty limit, and b) system usage is at system limit Then the above patch will not trigger writeback. Example with two memcg: sys B C limit usage sys 10 10 B 7 6 C 5 4 If B wants to grow, the system will exceed system limit of 10 and should be throttled. However, the smaller limit (7) will be selected and applied to memcg usage (6), which indicates no need to throttle, so the system could get as bad as: limit usage sys 10 12 B 7 7 C 5 5 In this case the system usage exceeds the system limit because each the per-memcg checks see no per-memcg problems. To solve this I propose we create a new structure to aggregate both dirty limit and usage data: struct dirty_limits { unsigned long dirty_thresh; unsigned long background_thresh; unsigned long nr_reclaimable; unsigned long nr_writeback; }; global_dirty_limits() would then query both the global and memcg limits and dirty usage of one that is closest to its limit. This change makes global_dirty_limits() look like: void global_dirty_limits(struct dirty_limits *limits) { unsigned long background; unsigned long dirty; unsigned long nr_reclaimable; unsigned long nr_writeback; unsigned long available_memory = determine_dirtyable_memory(); struct task_struct *tsk; if (vm_dirty_bytes) dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE); else dirty = (vm_dirty_ratio * available_memory) / 100; if (dirty_background_bytes) background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE); else background = (dirty_background_ratio * available_memory) / 100; nr_reclaimable = global_page_state(NR_FILE_DIRTY) + global_page_state(NR_UNSTABLE_NFS); nr_writeback = global_page_state(NR_WRITEBACK); if (mem_cgroup_dirty_limits(available_memory, limits) && dirty_available(limits->dirty_thresh, limits->nr_reclaimable, limits->nr_writeback) < dirty_available(dirty, nr_reclaimable, nr_writeback)) { dirty = min(dirty, limits->dirty_thresh); background = min(background, limits->background_thresh); } else { limits->nr_reclaimable = nr_reclaimable; limits->nr_writeback = nr_writeback; } if (background >= dirty) background = dirty / 2; tsk = current; if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) { background += background / 4; dirty += dirty / 4; } limits->background_thresh = background; limits->dirty_thresh = dirty; } Because this approach considered both memcg and system limits, the problem described above is avoided. I have this change integrated into the memcg dirty limit series (-v3 was the last post; v4 is almost ready with this change). I will post -v4 with this approach is there is no strong objection. -- Greg -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>