On Sat, Apr 2, 2011 at 12:33 AM, Greg Thelen <gthelen@xxxxxxxxxx> wrote: > On Fri, Apr 1, 2011 at 2:49 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: >> On Fri, Apr 01, 2011 at 01:18:38PM -0400, Vivek Goyal wrote: >>> On Fri, Apr 01, 2011 at 09:27:56AM +1100, Dave Chinner wrote: >>> > On Thu, Mar 31, 2011 at 07:50:23AM -0700, Greg Thelen wrote: >>> > > There >>> > > is no context (memcg or otherwise) given to the bdi flusher. After >>> > > the bdi flusher checks system-wide background limits, it uses the >>> > > over_bg_limit list to find (and rotate) an over limit memcg. Using >>> > > the memcg, then the per memcg per bdi dirty inode list is walked to >>> > > find inode pages to writeback. Once the memcg dirty memory usage >>> > > drops below the memcg-thresh, the memcg is removed from the global >>> > > over_bg_limit list. >>> > >>> > If you want controlled hand-off of writeback, you need to pass the >>> > memcg that triggered the throttling directly to the bdi. You already >>> > know what both the bdi and memcg that need writeback are. Yes, this >>> > needs concurrency at the BDI flush level to handle, but see my >>> > previous email in this thread for that.... >>> > >>> >>> Even with memcg being passed around I don't think that we get rid of >>> global list lock. >> >> You need to - we're getting rid of global lists and locks from >> writeback for scalability reasons so any new functionality needs to >> avoid global locks for the same reason. >> >>> The reason being that inodes are not exclusive to >>> the memory cgroups. Multiple memory cgroups might be writting to same >>> inode. So inode still remains in the global list and memory cgroups >>> kind of will have pointer to it. >> >> So two dirty inode lists that have to be kept in sync? That doesn't >> sound particularly appealing. Nor does it scale to an inode being >> dirty in multiple cgroups >> >> Besides, if you've got multiple memory groups dirtying the same >> inode, then you cannot expect isolation between groups. I'd consider >> this a broken configuration in this case - how often does this >> actually happen, and what is the use case for supporting >> it? >> >> Besides, the implications are that we'd have to break up contiguous >> IOs in the writeback path simply because two sequential pages are >> associated with different groups. That's really nasty, and exactly >> the opposite of all the write combining we try to do throughout the >> writeback path. Supporting this is also a mess, as we'd have to touch >> quite a lot of filesystem code (i.e. .writepage(s) inplementations) >> to do this. >> >>> So to start writeback on an inode >>> you still shall have to take global lock, IIUC. >> >> Why not simply bdi -> list of dirty cgroups -> list of dirty inodes >> in cgroup, and go from there? I mean, really all that cgroup-aware >> writeback needs is just adding a new container for managing >> dirty inodes in the writeback path and a method for selecting that >> container for writeback, right? > > I feel compelled to optimize for multiple cgroup's concurrently Correction: I do NOT feel compelled to optimizing for sharing... > dirtying an inode. I see sharing as legitimate if a file is handed > off between jobs (cgroups). But I do not see concurrent writing as a > common use case. > If anyone else feels this is a requirement, please speak up. > However, I would like the system tolerate sharing, though it does not > have to do so in optimal fashion. > Here are two approaches that do not optimize for sharing. Though, > each approach tries to tolerate sharing without falling over. > > Approach 1 (inspired from Dave's comments): > > bdi ->1:N -> bdi_memcg -> 1:N -> bdi_memcg_dirty_inode > > * when setting I_DIRTY in a memcg, insert inode into > bdi_memcg_dirty_inodes rather than b_dirty. > > * when clearing I_DIRTY, remove inode from bdi_memcg_dirty_inode > > * balance_dirty_pages() -> mem_cgroup_balance_dirty_pages(memcg, bdi) > if over bg limit, then queue memcg writeback to bdi flusher. > if over fg limit, then queue memcg-waiting description to bdi flusher > (IO less throttle). > > * bdi_flusher(bdi): > using bdi,memcg write “some” of the bdi_memcg_dirty_inodes list. > “Some” is for fairness. > > if bdi flusher is unable to bring memcg dirty usage below bg limit > after bdi_memcg_dirty_inodes list is empty, then need to do > “something” to make forward progress. This could be caused by either > (a) memcg dirtying multiple bdi, or (b) a freeloading memcg dirtying > inodes previously dirtied by another memcg therefore the first > dirtying memcg is the one that will write it back. > > Case A) If memcg dirties multiple bdi and then hits memcg bg limit, > queue bg writeback for the bdi being written to. This may not > writeback other useful bdi. System-wide background limit has similar > issue. Could link bdi_memcg together and wakeup peer bdi. For now, > defer the problem. > > Case B) Dirtying another cgroup’s dirty inode. While is not a common > use case, it could happen. Options to avoid lockup: > > + When an inode becomes dirty shared, then move the inode from the per > bdi per memcg bdi_memcg_dirty_inode list to an otherwise unused bdi > wide b_unknown_memcg_dirty (promiscuous inode) list. > b_unknown_memcg_dirty is written when memcg writeback is invoked to > the bdi. When an inode is cleaned and later redirtied it is added to > the normal bdi_memcg_dirty_inode_list. > > + Considered: when file page goes dirty, then do not account the dirty > page to the memcg where the page was charged, instead recharge the > page to the memcg that the inode was billed to (by inode i_memcg > field). Inode would require a memcg reference that would make memcg > cleanup tricky. > > + Scan memcg lru for dirty file pages -> associated inodes -> bdi -> > writeback(bdi, inode) > > + What if memcg dirty limits are simply ignored in case-B? > Ineffective memcg background writeback would be queued as usage grows. > Once memcg foreground limit is hit, then it would throttle waiting > for the ineffective background writeback to never catch up. This > could wait indefinitely. Could argue that the hung cgroup deserves > this for writing to another cgroup’s inode. However, the other cgroup > could be the trouble maker who sneaks in to dirty the file and assume > dirty ownership before the innocent (now hung) cgroup starts writing. > I am not worried about making this optimal, just making forward > progress. Fallback to scanning memcg lru looking for inode’s of dirty > pages. This may be expensive, but should only happen with dirty > inodes shared between memcg. > > > Approach 2 : do something even simpler: > > http://www.gossamer-threads.com/lists/linux/kernel/1347359#1347359 > > * __set_page_dirty() > > either set i_memcg=memcg or i_memcg=~0 > no memcg reference needed, i_memcg is not dereferenced > > * mem_cgroup_balance_dirty_pages(memcg, bdi) > > if over bg limit, then queue memcg to bdi for background writeback > if over fg limit, then queue memcg-waiting description to bdi flusher > (IO less throttle) > > * bdi_flusher(bdi) > > if doing memcg writeback, scan b_dirty filtering using > is_memcg_inode(inode,memcg), which checks i_memcg field: return > i_memcg in [~0, memcg] > > if unable to get memcg below its dirty memory limit: > > + If memcg dirties multiple bdi and then hits memcg bg limit, queue bg > writeback for the bdi being written to. This may not writeback other > useful bdi. System-wide background limit has similar issue. > > - con: If degree of sharing exceeds compile time max supported sharing > degree (likely 1), then ANY writeback (per-memcg or system-wide) will > writeback the over-shared inode. This is undesirable because it > punishes innocent cgroups that are not abusively sharing. > > - con: have to scan entire b_dirty list which may involve skipping > many inodes not in over-limit cgroup. A memcg constantly hitting its > limit would monopolize a bdi flusher. > > > Both approaches are complicated by the (rare) possibility when an > inode has been been claimed (from a dirtying memcg perspective) by > memcg M1 but later M2 writes more dirty pages. When M2 exceeds its > dirty limit it would be nice to find the inode, even if this requires > some extra work. > >> Cheers, >> >> Dave. >> -- >> Dave Chinner >> david@xxxxxxxxxxxxx >> > -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html