On Wed, Jun 23, 2010 at 11:06:04AM +0800, Dave Chinner wrote: > On Wed, Jun 23, 2010 at 09:34:26AM +0800, Wu Fengguang wrote: > > On Wed, Jun 23, 2010 at 06:45:51AM +0800, Dave Chinner wrote: > > > On Tue, Jun 22, 2010 at 04:38:56PM +0200, Jan Kara wrote: > > > > On Tue 22-06-10 10:31:24, Christoph Hellwig wrote: > > > > > On Tue, Jun 22, 2010 at 09:52:34PM +0800, Wu Fengguang wrote: > > > > > > 2) most writeback will be submitted by one per-bdi-flusher, so no worry > > > > > > of cache bouncing (this also means the per CPU counter error is > > > > > > normally bounded by the batch size) > > > > > > > > > > What counter are we talking about exactly? Once balanance_dirty_pages > > > > The new per-bdi counter I'd like to introduce. > > > > > > > > > stops submitting I/O the per-bdi flusher thread will in fact be > > > > > the only thing submitting writeback, unless you count direct invocations > > > > > of writeback_single_inode. > > > > Yes, I agree that the per-bdi flusher thread should be the only thread > > > > submitting lots of IO (there is direct reclaim or kswapd if we change > > > > direct reclaim but those should be negligible). So does this mean that > > > > also I/O completions will be local to the CPU running per-bdi flusher > > > > thread? Because the counter is incremented from the I/O completion > > > > callback. > > > > > > By default we set QUEUE_FLAG_SAME_COMP, which means we hand > > > completions back to the submitter CPU during blk_complete_request(). > > > Completion processing is then handled by a softirq on the CPU > > > selected for completion processing. > > > > Good to know about that, thanks! > > > > > This was done, IIRC, because it provided some OLTP benchmark 1-2% > > > better results. It can, however, be turned off via > > > /sys/block/<foo>/queue/rq_affinity, and there's no guarantee that > > > the completion processing doesn't get handled off to some other CPU > > > (e.g. via a workqueue) so we cannot rely on this completion > > > behaviour to avoid cacheline bouncing. > > > > If rq_affinity does not work reliably somewhere in the IO completion > > path, why not trying to fix it? > > Because completion on the submitter CPU is not ideal for high > bandwidth buffered IO. Yes there may be heavy post-processing for read data, however for writes it is mainly the pre-processing that costs CPU? So perfect rq_affinity should always benefit write IO? > > Otherwise all the page/mapping/zone > > cachelines covered by test_set_page_writeback()/test_clear_page_writeback() > > (and more other functions) will also be bounced. > > Yes, but when the flusher thread is approaching being CPU bound for > high throughput IO, bouncing cachelines to another CPU during > completion costs far less in terms of throughput compared to > reducing the amount of time available to issue IO on that CPU. Yes, reasonable for reads. > > Another option is to put atomic accounting into test_set_page_writeback() > > ie. the IO submission path. This actually matches the current > > balanance_dirty_pages() behavior. It may then block on get_request(). > > The down side is, get_request() blocks until queue depth goes down > > from nr_congestion_on to nr_congestion_off, which is not as smooth as > > the IO completion path. As a result balanance_dirty_pages() may get > > delayed much more than necessary when there is only 1 waiter, and > > wake up multiple waiters in bursts. > > Being reliant on the block layer queuing behaviour for VM congestion > control is exactly the problem are trying to avoid... Yes this is not a good option. The paragraph looks more like stating a potential benefit of the proposed patch :) Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>