On Sun, Jan 22, 2012 at 09:27:14AM -0600, James Bottomley wrote: > Since a lot of these are mm related; added linux-mm to cc list > > On Sun, 2012-01-22 at 15:50 +0200, Boaz Harrosh wrote: > > [Targeted writeback (IO-less page-reclaim)] > > Sometimes we would need to write a certain page or group of pages. It could be > > nice to prioritize/start the writeback on these pages, through the regular writeback > > mechanism instead of doing direct IO like today. > > > > This is actually related to above where we can have a "write_now" time constant that > > makes the priority of that inode to be written first. Then we also need the page-info > > that we want to write as part of that inode's IO. Usually today we start at the lowest > > indexed page of the inode, right? In targeted writeback we should make sure the writeout > > is the longest contiguous (aligned) dirty region containing the targeted page. > > > > With this in place we can also move to an IO-less page-reclaim. that is done entirely by > > the BDI thread writeback. (Need I say more) > > All of the above are complex. The only reason for adding complexity in > our writeback path should be because we can demonstrate that it's > actually needed. In order to demonstrate this, you'd need performance > measurements ... is there a plan to get these before the summit? The situations that required writeback for reclaim to make progress have shrunk a lot with this merge window because of respecting page reserves in the dirty limits, and per-zone dirty limits. What's left to evaluate are certain NUMA configurations where the dirty pages are concentrated on a few nodes. Currently, we kick the flushers from direct reclaim, completely undirected, just "clean some pages, please". That works for systems up to a certain size, depending on the size of the node in relationship to the system as a whole (likelihood of pages cleaned being from the target node) and how fast the backing storage is (impact of cleaning 'wrong' pages). So while the original problem is still standing, the urgency of it might have been reduced quite a bit or the problem itself might have been pushed into a corner where workarounds (spread dirty data more evenly e.g.) might be more economical than trying to make writeback node-aware and deal with all the implications (still have to guarantee dirty cache expiration times for integrity; can fail spectacularly when there is little or no relationship between disk placement and memory placement, imagine round-robin allocation of disk-contiguous dirty cache over a few nodes). I agree with James: find scenarios where workarounds are not feasible but that are important enough that the complexity would be justified. Otherwise, talking about how to fix them is moot. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html