Re: [PATCH 0/7] memcg background reclaim , yet another one.

Ying Han <yinghan@xxxxxxxxxx> · Mon, 25 Apr 2011 15:21:21 -0700

Kame:

Thank you for putting time on implementing the patch. I think it is
definitely a good idea to have the two alternatives on the table since
people has asked the questions. Before going down to the track, i have
thought about the two approaches and also discussed with Greg and Hugh
(cc-ed),  i would like to clarify some of the pros and cons on both
approaches.  In general, I think the workqueue is not the right answer
for this purpose.

The thread-pool model
Pros:
1. there is no isolation between memcg background reclaim, since the
memcg threads are shared. That isolation including all the resources
that the per-memcg background reclaim will need to access, like cpu
time. One thing we are missing for the shared worker model is the
individual cpu scheduling ability. We need the ability to isolate and
count the resource assumption per memcg, and including how much
cputime and where to run the per-memcg kswapd thread.

2. it is hard for visibility and debugability. We have been
experiencing a lot when some kswapds running creazy and we need a
stright-forward way to identify which cgroup causing the reclaim. yes,
we can add more stats per-memcg to sort of giving that visibility, but
I can tell they are involved w/ more overhead of the change. Why
introduce the over-head if the per-memcg kswapd thread can offer that
maturely.

3. potential priority inversion for some memcgs. Let's say we have two
memcgs A and B on a single core machine, and A has big chuck of work
and B has small chuck of work. Now B's work is queued up after A. In
the workqueue model, we won't process B unless we finish A's work
since we only have one worker on the single core host. However, in the
per-memcg kswapd model, B got chance to run when A calls
cond_resched(). Well, we might not having the exact problem if we
don't constrain the workers number, and the worst case we'll have the
same number of workers as the number of memcgs. If so, it would be the
same model as per-memcg kswapd.

4. the kswapd threads are created and destroyed dynamically. are we
talking about allocating 8k of stack for kswapd when we are under
memory pressure? In the other case, all the memory are preallocated.

5. the workqueue is scary and might introduce issues sooner or later.
Also, why we think the background reclaim fits into the workqueue
model, and be more specific, how that share the same logic of other
parts of the system using workqueue.

Cons:
1. save SOME memory resource.

The per-memcg-per-kswapd model
Pros:
1. memory overhead per thread, and The memory consumption would be
8k*1000 = 8M with 1k cgroup. This is NOT a problem as least we haven't
seen it in our production. We have cases that 2k of kernel threads
being created, and we haven't noticed it is causing resource
consumption problem as well as performance issue. On those systems, we
might have ~100 cgroup running at a time.

2. we see lots of threads at 'ps -elf'. well, is that really a problem
that we need to change the threading model?

Overall, the per-memcg-per-kswapd thread model is simple enough to
provide better isolation (predictability & debug ability). The number
of threads we might potentially have on the system is not a real
problem. We already have systems running that much of threads (even
more) and we haven't seen problem of that. Also, i can imagine it will
make our life easier for some other extensions on memcg works.

For now, I would like to stick on the simple model. At the same time I
am willing to looking into changes and fixes whence we have seen
problems later.

Comments?

Thanks

--Ying

On Mon, Apr 25, 2011 at 3:14 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@xxxxxxxxxxxxxx> wrote:
> On Mon, 25 Apr 2011 18:25:29 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx> wrote:
>
>
>> 2) == hard limit 500M/ hi_watermark = 400M ==
>> [root@rhel6-test hilow]# time cp ./tmpfile xxx
>>
>> real    0m6.421s
>> user    0m0.059s
>> sys     0m2.707s
>>
>
> When doing this, we see usage changes as
> (sec) (bytes)
>   0: 401408        <== cp start
>   1: 98603008
>   2: 262705152
>   3: 433491968     <== wmark reclaim triggerd.
>   4: 486502400
>   5: 507748352
>   6: 524189696     <== cp ends (and hit limits)
>   7: 501231616
>   8: 499511296
>   9: 477118464
>  10: 417980416     <== usage goes below watermark.
>  11: 417980416
>  .....
>
> If we have dirty_ratio, this result will be some different.
> (and flusher thread will work sooner...)
>
>
> Thanks,
> -Kame
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxxx  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href