Re: [PATCH RFC] mm: Implement balance_dirty_pages() through waiting for flusher thread

Wu Fengguang <fengguang.wu@xxxxxxxxx> · Wed, 23 Jun 2010 14:25:40 +0800

On Wed, Jun 23, 2010 at 02:03:19PM +0800, Dave Chinner wrote:
> On Wed, Jun 23, 2010 at 11:22:13AM +0800, Wu Fengguang wrote:
> > On Wed, Jun 23, 2010 at 11:06:04AM +0800, Dave Chinner wrote:
> > > On Wed, Jun 23, 2010 at 09:34:26AM +0800, Wu Fengguang wrote:
> > > > On Wed, Jun 23, 2010 at 06:45:51AM +0800, Dave Chinner wrote:
> > > > > On Tue, Jun 22, 2010 at 04:38:56PM +0200, Jan Kara wrote:
> > > > > > On Tue 22-06-10 10:31:24, Christoph Hellwig wrote:
> > > > > > > On Tue, Jun 22, 2010 at 09:52:34PM +0800, Wu Fengguang wrote:
> > > > > > > > 2) most writeback will be submitted by one per-bdi-flusher, so no worry
> > > > > > > >    of cache bouncing (this also means the per CPU counter error is
> > > > > > > >    normally bounded by the batch size)
> > > > > > > 
> > > > > > > What counter are we talking about exactly?  Once balanance_dirty_pages
> > > > > >   The new per-bdi counter I'd like to introduce.
> > > > > > 
> > > > > > > stops submitting I/O the per-bdi flusher thread will in fact be
> > > > > > > the only thing submitting writeback, unless you count direct invocations
> > > > > > > of writeback_single_inode.
> > > > > >   Yes, I agree that the per-bdi flusher thread should be the only thread
> > > > > > submitting lots of IO (there is direct reclaim or kswapd if we change
> > > > > > direct reclaim but those should be negligible). So does this mean that
> > > > > > also I/O completions will be local to the CPU running per-bdi flusher
> > > > > > thread? Because the counter is incremented from the I/O completion
> > > > > > callback.
> > > > > 
> > > > > By default we set QUEUE_FLAG_SAME_COMP, which means we hand
> > > > > completions back to the submitter CPU during blk_complete_request().
> > > > > Completion processing is then handled by a softirq on the CPU
> > > > > selected for completion processing.
> > > > 
> > > > Good to know about that, thanks!
> > > > 
> > > > > This was done, IIRC, because it provided some OLTP benchmark 1-2%
> > > > > better results. It can, however, be turned off via
> > > > > /sys/block/<foo>/queue/rq_affinity, and there's no guarantee that
> > > > > the completion processing doesn't get handled off to some other CPU
> > > > > (e.g. via a workqueue) so we cannot rely on this completion
> > > > > behaviour to avoid cacheline bouncing.
> > > > 
> > > > If rq_affinity does not work reliably somewhere in the IO completion
> > > > path, why not trying to fix it?
> > > 
> > > Because completion on the submitter CPU is not ideal for high
> > > bandwidth buffered IO.
> > 
> > Yes there may be heavy post-processing for read data, however for writes
> > it is mainly the pre-processing that costs CPU?
> 
> Could be either - delayed allocation requires significant pre-processing
> for allocation. Avoiding this by using preallocation just
> moves the processing load to IO completion which needs to issue
> transactions to mark the region written.

Good point, thanks.

> > So perfect rq_affinity
> > should always benefit write IO?
> 
> No, because the flusher thread gets to be CPU bound just writing
> pages, allocating blocks and submitting IO. It might take 5-10GB/s
> to get there (say a million dirty pages a second being processed by
> a single CPU), but that's the sort of storage subsystem XFS is
> capable of driving. IO completion time for such a workload is
> significant, too, so putting that on the same CPU as the flusher
> thread will slow things down by far more than gain from avoiding
> cacheline bouncing.

So super fast storage is going to demand multiple flushers per bdi.
And once we run multiple flushers for one bdi, it will again be
beneficial to schedule IO completion to the flusher CPU :)

However in this case per bdi atomic value (and some others) will be
bounced among the flusher CPUs...

> > > > Otherwise all the page/mapping/zone
> > > > cachelines covered by test_set_page_writeback()/test_clear_page_writeback()
> > > > (and more other functions) will also be bounced.
> > > 
> > > Yes, but when the flusher thread is approaching being CPU bound for
> > > high throughput IO, bouncing cachelines to another CPU during
> > > completion costs far less in terms of throughput compared to
> > > reducing the amount of time available to issue IO on that CPU.
> > 
> > Yes, reasonable for reads.
> 
> I was taking about writes - the flusher threads don't do any reading ;)

Ah, sorry for not catching the words "flusher thread" :)

Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html