On Tue, Apr 19, 2011 at 3:09 AM, Ying Han <yinghan@xxxxxxxxxx> wrote: > > > On Sun, Apr 17, 2011 at 5:57 PM, Minchan Kim <minchan.kim@xxxxxxxxx> wrote: >> >> Hi Ying, >> >> I have some comments and nitpick about coding style. > > Hi Minchan, thank you for your comments and reviews. >> >> On Sat, Apr 16, 2011 at 8:23 AM, Ying Han <yinghan@xxxxxxxxxx> wrote: >> > There is a kswapd kernel thread for each numa node. We will add a >> > different >> > kswapd for each memcg. The kswapd is sleeping in the wait queue headed >> > at >> >> Why? >> >> Easily, many kernel developers raise an eyebrow to increase kernel thread. >> So you should justify why we need new kernel thread, why we can't >> handle it with workqueue. >> >> Maybe you explained it and I didn't know it. If it is, sorry. >> But at least, the patch description included _why_ is much mergeable >> to maintainers and helpful to review the code to reviewers. > > Here are the replies i posted on earlier version regarding on workqueue. > " > I did some study onÂworkqueueÂafter posting V2. There was a > commentÂsuggestingÂworkqueueÂinstead of per-memcg kswapd thread, since it > will cut the number of kernel threads being created in host with lots of > cgroups. Each kernel thread allocates about 8K of stack and 8M in total w/ > thousand of cgroups. > The currentÂworkqueueÂmodel merged in 2.6.36 kernel is called "concurrency > managed workqueu(cmwq)", which is intended to provide flexible concurrency > without wasting resources. I studied a bit and here it is: > > 1. TheÂworkqueueÂis complicated and we need to be very careful ofÂworkÂitems > in theÂworkqueue. We've experienced in one workitem stucks and the rest of > theÂworkÂitem won't proceed. For example in dirty page writeback, Âone > heavily writer cgroup could starve the other cgroups from flushing dirty > pages to the same disk. In the kswapd case, I can image we might have > similar scenario. > > 2. How to prioritize the workitems is another problem. The order of adding > the workitems in theÂqueueÂreflects the order of cgroups being reclaimed. We > don't have that restriction currently but relying on the cpu scheduler to > put kswapd on the right cpu-core to run. We "might" introduce priority later > for reclaim and how are we gonna deal with that. > > 3. Based on what i observed, not many callers has migrated to the cmwq and I > don't have much data of how good it is. > Back to the current model, on machine with thousands of cgroups which it > will take 8M total for thousand of kswapd threads (8K stack for each > thread). ÂWe are running system with fakenuma which each numa node has a > kswapd. So far we haven't noticed issue caused by "lots of"Âkswapd threads. > Also, there shouldn't be any performance overhead for kernel thread if it is > not running. > > Based on the complexity ofÂworkqueueÂand the benefit it provides, I would > like to stick to the current model first. After we get the basic stuff in > and otherÂtargetingÂreclaim improvement, we can come back to this. What do > you think? Thanks for the good summary. I should study cmwq, too but I don't mean we have to use only workqueue. The problem of memcg-kswapd flooding is the lock, schedule overhead, pid consumption as well as memory consumption and it is not good at keeping the kswapd busy. I don't think we have to keep the number of kswapd as much as memcg. We can just keep memcg-kswapd min/max pool like old pdflush. As memcg wmark pressure is high and memcg-kswapd is busy, new memcg-kswapd will pop up to handle it. As memcg wmark pressure is low and memcg-kswapd is idle, old memcg-kswapd will be exited. > " > KAMEZAWA's reply: > " > Okay, fair enough. kthread_run() will win. > > Then, I have another request. I'd like to kswapd-for-memcg to some cpu > cgroup to limit cpu usage. > > - Could you show thread ID somewhere ? and > Âconfirm we can put it to some cpu cgroup ? > Â(creating a auto cpu cgroup for memcg kswapd is a choice, I think.) > " >> >> > kswapd_wait field of a kswapd descriptor. The kswapd descriptor stores >> > information of node or memcg and it allows the global and per-memcg >> > background >> > reclaim to share common reclaim algorithms. >> > >> > This patch adds the kswapd descriptor and moves the per-node kswapd to >> > use the >> > new structure. >> > >> > changelog v5..v4: >> > 1. add comment on kswapds_spinlock >> > 2. remove the kswapds_spinlock. we don't need it here since the kswapd >> > and pgdat >> > have 1:1 mapping. >> > >> > changelog v3..v2: >> > 1. move the struct mem_cgroup *kswapd_mem in kswapd sruct to later >> > patch. >> > 2. rename thr in kswapd_run to something else. >> > >> > changelog v2..v1: >> > 1. dynamic allocate kswapd descriptor and initialize the wait_queue_head >> > of pgdat >> > at kswapd_run. >> > 2. add helper macro is_node_kswapd to distinguish per-node/per-cgroup >> > kswapd >> > descriptor. >> > >> > Signed-off-by: Ying Han <yinghan@xxxxxxxxxx> >> > --- >> > Âinclude/linux/mmzone.h |  Â3 +- >> > Âinclude/linux/swap.h  |  Â7 ++++ >> > Âmm/page_alloc.c    Â|  Â1 - >> > Âmm/vmscan.c      Â|  89 >> > +++++++++++++++++++++++++++++++++++------------ >> > Â4 files changed, 74 insertions(+), 26 deletions(-) >> > >> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h >> > index 628f07b..6cba7d2 100644 >> > --- a/include/linux/mmzone.h >> > +++ b/include/linux/mmzone.h >> > @@ -640,8 +640,7 @@ typedef struct pglist_data { >> >    Âunsigned long node_spanned_pages; /* total size of physical page >> >                       range, including holes */ >> >    Âint node_id; >> > -    wait_queue_head_t kswapd_wait; >> > -    struct task_struct *kswapd; >> > +    wait_queue_head_t *kswapd_wait; >> >> Personally, I prefer kswapd not kswapd_wait. >> It's more readable and straightforward. > > hmm. I would like to keep as it is for this version, and improve it after > the basic stuff are in. Hope that works for you? No problem. >> >> >    Âint kswapd_max_order; >> >    Âenum zone_type classzone_idx; >> > Â} pg_data_t; >> > diff --git a/include/linux/swap.h b/include/linux/swap.h >> > index ed6ebe6..f43d406 100644 >> > --- a/include/linux/swap.h >> > +++ b/include/linux/swap.h >> > @@ -26,6 +26,13 @@ static inline int current_is_kswapd(void) >> >    Âreturn current->flags & PF_KSWAPD; >> > Â} >> > >> > +struct kswapd { >> > +    struct task_struct *kswapd_task; >> > +    wait_queue_head_t kswapd_wait; >> > +    pg_data_t *kswapd_pgdat; >> > +}; >> > + >> > +int kswapd(void *p); >> > Â/* >> > Â* MAX_SWAPFILES defines the maximum number of swaptypes: things which >> > can >> > Â* be swapped to. ÂThe swap type and the offset into that swap type are >> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c >> > index 6e1b52a..6340865 100644 >> > --- a/mm/page_alloc.c >> > +++ b/mm/page_alloc.c >> > @@ -4205,7 +4205,6 @@ static void __paginginit >> > free_area_init_core(struct pglist_data *pgdat, >> > >> >    Âpgdat_resize_init(pgdat); >> >    Âpgdat->nr_zones = 0; >> > -    init_waitqueue_head(&pgdat->kswapd_wait); >> >    Âpgdat->kswapd_max_order = 0; >> >    Âpgdat_page_cgroup_init(pgdat); >> > >> > diff --git a/mm/vmscan.c b/mm/vmscan.c >> > index 060e4c1..61fb96e 100644 >> > --- a/mm/vmscan.c >> > +++ b/mm/vmscan.c >> > @@ -2242,12 +2242,13 @@ static bool pgdat_balanced(pg_data_t *pgdat, >> > unsigned long balanced_pages, >> > Â} >> > >> > Â/* is kswapd sleeping prematurely? */ >> > -static bool sleeping_prematurely(pg_data_t *pgdat, int order, long >> > remaining, >> > -                    int classzone_idx) >> > +static int sleeping_prematurely(struct kswapd *kswapd, int order, >> > +                long remaining, int classzone_idx) >> > Â{ >> >    Âint i; >> >    Âunsigned long balanced = 0; >> >    Âbool all_zones_ok = true; >> > +    pg_data_t *pgdat = kswapd->kswapd_pgdat; >> > >> >    Â/* If a direct reclaimer woke kswapd within HZ/10, it's premature >> > */ >> >    Âif (remaining) >> > @@ -2570,28 +2571,31 @@ out: >> >    Âreturn order; >> > Â} >> > >> > -static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int >> > classzone_idx) >> > +static void kswapd_try_to_sleep(struct kswapd *kswapd_p, int order, >> > +                int classzone_idx) >> > Â{ >> >    Âlong remaining = 0; >> >    ÂDEFINE_WAIT(wait); >> > +    pg_data_t *pgdat = kswapd_p->kswapd_pgdat; >> > +    wait_queue_head_t *wait_h = &kswapd_p->kswapd_wait; >> >> kswapd_p? p means pointer? > > yes, >> >> wait_h? h means header? > > Âyes, >> >> Hmm.. Of course, it's trivial and we can understand easily in such >> context but we don't have been used such words so it's rather awkward >> to me. >> >> How about kswapd instead of kswapd_p, kswapd_wait instead of wait_h? > > that sounds ok for me for the change. however i would like to make the > change as sperate patch after the basic stuff are in. Is that ok? Sure. It's not critical to merge the series. :) -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href