On Mon, Mar 14, 2011 at 1:23 PM, Vivek Goyal <vgoyal@xxxxxxxxxx> wrote: > On Mon, Mar 14, 2011 at 11:29:17AM -0700, Greg Thelen wrote: > > [..] >> > We could just crawl the memcg's page LRU and bring things under control >> > that way, couldn't we? That would fix it. What were the reasons for >> > not doing this? >> >> My rational for pursuing bdi writeback was I/O locality. I have heard that >> per-page I/O has bad locality. Per inode bdi-style writeback should have better >> locality. >> >> My hunch is the best solution is a hybrid which uses a) bdi writeback with a >> target memcg filter and b) using the memcg lru as a fallback to identify the bdi >> that needed writeback. I think the part a) memcg filtering is likely something >> like: >> http://marc.info/?l=linux-kernel&m=129910424431837 >> >> The part b) bdi selection should not be too hard assuming that page-to-mapping >> locking is doable. > > Greg, > > IIUC, option b) seems to be going through pages of particular memcg and > mapping page to inode and start writeback on particular inode? Yes. > If yes, this might be reasonably good. In the case when cgroups are not > sharing inodes then it automatically maps one inode to one cgroup and > once cgroup is over limit, it starts writebacks of its own inode. > > In case inode is shared, then we get the case of one cgroup writting > back the pages of other cgroup. Well I guess that also can be handeled > by flusher thread where a bunch or group of pages can be compared with > the cgroup passed in writeback structure. I guess that might hurt us > more than benefit us. Agreed. For now just writing the entire inode is probably fine. > IIUC how option b) works then we don't even need option a) where an N level > deep cache is maintained? Originally I was thinking that bdi-wide writeback with memcg filter was a good idea. But this may be unnecessarily complex. Now I am agreeing with you that option (a) may not be needed. Memcg could queue per-inode writeback using the memcg lru to locate inodes (lru->page->inode) with something like this in [mem_cgroup_]balance_dirty_pages(): while (memcg_usage() >= memcg_fg_limit) { inode = memcg_dirty_inode(cg); /* scan lru for a dirty page, then grab mapping & inode */ sync_inode(inode, &wbc); } if (memcg_usage() >= memcg_bg_limit) { queue per-memcg bg flush work item } Does this look sensible? -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html