Re: [PATCH V7 4/9] Add memcg kswapd thread pool

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Kame,

2011/4/22 KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx>
On Thu, 21 Apr 2011 21:49:04 -0700
Ying Han <yinghan@xxxxxxxxxx> wrote:

> On Thu, Apr 21, 2011 at 9:36 PM, KAMEZAWA Hiroyuki <
> kamezawa.hiroyu@xxxxxxxxxxxxxx> wrote:
>
> > On Thu, 21 Apr 2011 21:24:15 -0700
> > Ying Han <yinghan@xxxxxxxxxx> wrote:
> >
> > > This patch creates a thread pool for memcg-kswapd. All memcg which needs
> > > background recalim are linked to a list and memcg-kswapd picks up a memcg
> > > from the list and run reclaim.
> > >
> > > The concern of using per-memcg-kswapd thread is the system overhead
> > including
> > > memory and cputime.
> > >
> > > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx>
> > > Signed-off-by: Ying Han <yinghan@xxxxxxxxxx>
> >
> > Thank you for merging. This seems ok to me.
> >
> > Further development may make this better or change thread pools (to some
> > other),
> > but I think this is enough good.
> >
>
> Thank you for reviewing and Acking. At the same time, I do have wondering on
> the thread-pool modeling which I posted on the cover-letter :)
>
> The per-memcg-per-kswapd model
> Pros:
> 1. memory overhead per thread, and The memory consumption would be 8k*1000 =
> 8M
> with 1k cgroup.
> 2. we see lots of threads at 'ps -elf'
>
> Cons:
> 1. the implementation is simply and straigh-forward.
> 2. we can easily isolate the background reclaim overhead between cgroups.
> 3. better latency from memory pressure to actual start reclaiming
>
> The thread-pool model
> Pros:
> 1. there is no isolation between memcg background reclaim, since the memcg
> threads
> are shared.
> 2. it is hard for visibility and debugability. I have been experienced a lot
> when
> some kswapds running creazy and we need a stright-forward way to identify
> which
> cgroup causing the reclaim.
> 3. potential starvation for some memcgs, if one workitem stucks and the rest
> of work
> won't proceed.
>
> Cons:
> 1. save some memory resource.
>
> In general, the per-memcg-per-kswapd implmentation looks sane to me at this
> point, esepcially the sharing memcg thread model will make debugging issue
> very hard later.
>
> Comments?
>
Pros <-> Cons ?

My idea is adding trace point for memcg-kswapd and seeing what it's now doing.
(We don't have too small trace point in memcg...)

I don't think its sane to create kthread per memcg because we know there is a user
who makes hundreds/thousands of memcg.

I think we need to think about the exact usage of Â'thousands of cgroups' in this case. Although not quite in detail, in Ying's previous email she did say that they created thousands of cgroups on each box in Google's cluster and most of them _slept_ in most of the time. So I guessÂactually what they did is creating a larger number of cgroups, each of them has different limits on various resources. Then on the time of job dispatching, they can choose a suitable group from each box and submit the job into it - without touching the other thousands of sleeping groups. That's to say, though Google has a huge number of groups on each box, they have only few jobs on it, so it's impossible to see too many busy groups at the same time.
If above is correct, then I think Ying can call kthread_stop at the moment we find there's no tasks in the group anymore, to kill the memcg thread (as this group is expected to sleep for a long time after all the job leave). In this way we can keep the number of memcg threads small and don't lose theÂdebug-ability.
What do you think?

Regards,
Zhu Yanhai

And, I think that creating threads, which does the same job, more than the number
of cpus will cause much more difficult starvation, priority inversion issue.
Keeping scheduling knob/chances of jobs in memcg is important. I don't want to
give a hint to scheduler because of memcg internal issue.

And, even if memcg-kswapd doesn't exist, memcg works (well?).
memcg-kswapd just helps making things better but not do any critical jobs.
So, it's okay to have this as best-effort service.
Of course, better scheduling idea for picking up memcg is welcomed. It's now
round-robin.

Thanks,
-Kame



[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [ECOS]     [Asterisk Internet PBX]     [Linux API]