Re: [PATCH RFC] mm: Implement balance_dirty_pages() through waiting for flusher thread

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 23 Jun 2010 08:29:32 +1000

On Tue, Jun 22, 2010 at 04:02:59PM +0200, Jan Kara wrote:
> On Tue 22-06-10 21:52:34, Wu Fengguang wrote:
> > >   On the other hand I think we will have to come up with something
> > > more clever than what I do now because for some huge machines with
> > > nr_cpu_ids == 256, the error of the counter is 256*9*8 = 18432 so that's
> > > already unacceptable given the amounts we want to check (like 1536) -
> > > already for nr_cpu_ids == 32, the error is the same as the difference we
> > > want to check.  I think we'll have to come up with some scheme whose error
> > > is not dependent on the number of cpus or if it is dependent, it's only a
> > > weak dependency (like a logarithm or so).
> > >   Or we could rely on the fact that IO completions for a bdi won't happen on
> > > all CPUs and thus the error would be much more bounded. But I'm not sure
> > > how much that is true or not.
> > 
> > Yes the per CPU counter seems tricky. How about plain atomic operations? 
> > 
> > This test shows that atomic_dec_and_test() is about 4.5 times slower
> > than plain i-- in a 4-core CPU. Not bad.

It's not how fast an uncontended operation runs that matter - it's
what happens when it is contended by lots of CPUs. In my experience,
atomics in writeback paths scale to medium sized machines (say
16-32p) but bottleneck on larger configurations due to the increased
cost of cacheline propagation on larger machines.

> > Note that
> > 1) we can avoid the atomic operations when there are no active waiters

Under heavy IO load there will always be waiters.

> > 2) most writeback will be submitted by one per-bdi-flusher, so no worry
> >    of cache bouncing (this also means the per CPU counter error is
> >    normally bounded by the batch size)
>   Yes, writeback will be submitted by one flusher thread but the question
> is rather where the writeback will be completed. And that depends on which
> CPU that particular irq is handled. As far as my weak knowledge of HW goes,
> this very much depends on the system configuration (i.e., irq affinity and
> other things).

And how many paths to the storage you are using, how threaded the
underlying driver is, whether it is using MSI to direct interrupts to
multiple CPUs instead of just one, etc.

As we scale up we're more likely to see multiple CPUs doing IO
completion for the same BDI because the storage configs are more
complex in high end machines. Hence IMO preventing cacheline
bouncing between submission and completion is a significant
scalability concern.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxxx  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>