Kame: Thank you for putting time on implementing the patch. I think it is definitely a good idea to have the two alternatives on the table since people has asked the questions. Before going down to the track, i have thought about the two approaches and also discussed with Greg and Hugh (cc-ed), i would like to clarify some of the pros and cons on both approaches. In general, I think the workqueue is not the right answer for this purpose. The thread-pool model Pros: 1. there is no isolation between memcg background reclaim, since the memcg threads are shared. That isolation including all the resources that the per-memcg background reclaim will need to access, like cpu time. One thing we are missing for the shared worker model is the individual cpu scheduling ability. We need the ability to isolate and count the resource assumption per memcg, and including how much cputime and where to run the per-memcg kswapd thread. 2. it is hard for visibility and debugability. We have been experiencing a lot when some kswapds running creazy and we need a stright-forward way to identify which cgroup causing the reclaim. yes, we can add more stats per-memcg to sort of giving that visibility, but I can tell they are involved w/ more overhead of the change. Why introduce the over-head if the per-memcg kswapd thread can offer that maturely. 3. potential priority inversion for some memcgs. Let's say we have two memcgs A and B on a single core machine, and A has big chuck of work and B has small chuck of work. Now B's work is queued up after A. In the workqueue model, we won't process B unless we finish A's work since we only have one worker on the single core host. However, in the per-memcg kswapd model, B got chance to run when A calls cond_resched(). Well, we might not having the exact problem if we don't constrain the workers number, and the worst case we'll have the same number of workers as the number of memcgs. If so, it would be the same model as per-memcg kswapd. 4. the kswapd threads are created and destroyed dynamically. are we talking about allocating 8k of stack for kswapd when we are under memory pressure? In the other case, all the memory are preallocated. 5. the workqueue is scary and might introduce issues sooner or later. Also, why we think the background reclaim fits into the workqueue model, and be more specific, how that share the same logic of other parts of the system using workqueue. Cons: 1. save SOME memory resource. The per-memcg-per-kswapd model Pros: 1. memory overhead per thread, and The memory consumption would be 8k*1000 = 8M with 1k cgroup. This is NOT a problem as least we haven't seen it in our production. We have cases that 2k of kernel threads being created, and we haven't noticed it is causing resource consumption problem as well as performance issue. On those systems, we might have ~100 cgroup running at a time. 2. we see lots of threads at 'ps -elf'. well, is that really a problem that we need to change the threading model? Overall, the per-memcg-per-kswapd thread model is simple enough to provide better isolation (predictability & debug ability). The number of threads we might potentially have on the system is not a real problem. We already have systems running that much of threads (even more) and we haven't seen problem of that. Also, i can imagine it will make our life easier for some other extensions on memcg works. For now, I would like to stick on the simple model. At the same time I am willing to looking into changes and fixes whence we have seen problems later. Comments? Thanks --Ying On Mon, Apr 25, 2011 at 3:14 AM, KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx> wrote: > On Mon, 25 Apr 2011 18:25:29 +0900 > KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx> wrote: > > >> 2) == hard limit 500M/ hi_watermark = 400M == >> [root@rhel6-test hilow]# time cp ./tmpfile xxx >> >> real 0m6.421s >> user 0m0.059s >> sys 0m2.707s >> > > When doing this, we see usage changes as > (sec) (bytes) > 0: 401408 <== cp start > 1: 98603008 > 2: 262705152 > 3: 433491968 <== wmark reclaim triggerd. > 4: 486502400 > 5: 507748352 > 6: 524189696 <== cp ends (and hit limits) > 7: 501231616 > 8: 499511296 > 9: 477118464 > 10: 417980416 <== usage goes below watermark. > 11: 417980416 > ..... > > If we have dirty_ratio, this result will be some different. > (and flusher thread will work sooner...) > > > Thanks, > -Kame > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href