Re: [PATCH 1/4] Add kswapd descriptor.

Ying Han <yinghan@xxxxxxxxxx> · Tue, 7 Dec 2010 17:24:12 -0800

On Tue, Dec 7, 2010 at 4:39 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@xxxxxxxxxxxxxx> wrote:
> On Tue, 7 Dec 2010 09:28:01 -0800
> Ying Han <yinghan@xxxxxxxxxx> wrote:
>
>> On Tue, Dec 7, 2010 at 4:33 AM, Mel Gorman <mel@xxxxxxxxx> wrote:
>
>> Potentially there will
>> > also be a very large number of new IO sources. I confess I haven't read the
>> > thread yet so maybe this has already been thought of but it might make sense
>> > to have a 1:N relationship between kswapd and memcgroups and cycle between
>> > containers. The difficulty will be a latency between when kswapd wakes up
>> > and when a particular container is scanned. The closer the ratio is to 1:1,
>> > the less the latency will be but the higher the contenion on the LRU lock
>> > and IO will be.
>>
>> No, we weren't talked about the mapping anywhere in the thread. Having
>> many kswapd threads
>> at the same time isn't a problem as long as no locking contention (
>> ext, 1k kswapd threads on
>> 1k fake numa node system). So breaking the zone->lru_lock should work.
>>
>
> That's me who make zone->lru_lock be shared. And per-memcg lock will makes
> the maintainance of memcg very bad. That will add many races.
> Or we need to make memcg's LRU not synchronized with zone's LRU, IOW, we need
> to have completely independent LRU.
>
> I'd like to limit the number of kswapd-for-memcg if zone->lru lock contention
> is problematic. memcg _can_ work without background reclaim.

>
> How about adding per-node kswapd-for-memcg it will reclaim pages by a memcg's
> request ? as
>
>        memcg_wake_kswapd(struct mem_cgroup *mem)
>        {
>                do {
>                        nid = select_victim_node(mem);
>                        /* ask kswapd to reclaim memcg's memory */
>                        ret = memcg_kswapd_queue_work(nid, mem); /* may return -EBUSY if very busy*/
>                } while()
>        }
>
> This will make lock contention minimum. Anyway, using too much cpu for this
> unnecessary_but_good_for_performance_function is bad. Throttoling is required.

I don't see the problem of one-kswapd-per-cgroup here since there will
be no performance cost if they are not running.

I haven't measured the lock contention and cputime for each kswapd
running. Theoretically it would be a problem
if thousands of cgroups are configured on the the host and all of them
are under memory pressure.

We can either optimize the locking or make each kswapd smarter (hold
the lock less time). My current plan is to have the
one-kswapd-per-cgroup on the V2 patch w/ select_victim_node, and the
optimization for this comes as following patchset.

--Ying

>
> Thanks,
> -Kame
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxxx  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href