Re: [RFC PATCH 0/2] more raid456 thread pool experimentation

Chris Mason <chris.mason@xxxxxxxxxx> · Wed, 24 Mar 2010 15:31:56 -0400

On Wed, Mar 24, 2010 at 11:06:40AM -0700, Dan Williams wrote:
> On Wed, Mar 24, 2010 at 8:51 AM, Chris Mason <chris.mason@xxxxxxxxxx> wrote:
> > On Wed, Mar 24, 2010 at 07:53:10AM -0700, Dan Williams wrote:
> >> The current implementation with the async thread pool ends up spreading
> >> the work over too many threads.  The btrfs workqueue is targeted at high
> >> cpu utilization works and has a threshold mechanism to limit thread
> >> spawning.  Unfortunately it still ends up increasing cpu utilization
> >> without a comparable improvement in throughput.  Here are the numbers
> >> relative to the multicore disabled case:
> >>
> >> idle_thresh   throughput      cycles
> >> 4             +0%             +102%
> >> 64            +4%             +63%
> >> 128           +1%             +45%
> >
> > Interesting, do the btrfs workqueues improve things?  Or do you think they are
> > just a better base for more tuning?
> 
> Both, throughput falls off a cliff with the async thread pool, and
> there are more knobs to turn in this implementation.

Ok, it's really interesting that we're only getting you 4% faster tput,
at 63% more cycles.  That looks like something more than just cache
affinity, but at least for btrfs I always had bigger problems than
affinity, so I haven't dug into that part yet.

> 
> > I had always hoped to find more users for the work queues and tried to
> > keep btrfs specific features out of them.  The place I didn't entirely
> > succeed was in the spin locks, the ordered queues take regular spin
> > locks to avoid turning off irqs where btrfs always does things outside
> > of interrupt time.  Doesn't look like raid needs the ordered queues so
> > this should work pretty well.
> >
> >>
> >> This appears to show that something more fundamental needs to happen to
> >> take advantage of percpu raid processing.  More profiling is needed, but
> >> the suspects in my mind are conf->device_lock contention and the fact
> >> that all work is serialized through conf->handle_list with no method for
> >> encouraging stripe_head to thread affinity.
> >
> > The big place I'd suggest to look inside the btrfs async-thread.c for
> > optimization is the worker_loop().  For work that tends to be bursty and
> > relatively short, we can have worker threads finish their work fairly
> > quickly and go to sleep, only to be woken up very quickly again with
> > another big batch of work.  The worker_loop code tries to wait around
> > for a while, but the tuning here was btrfs specific.
> >
> > It might also help to tune the find_worker and next_worker code to prefer
> > giving work to threads that are running but almost done with their
> > queue.  Maybe they can be put onto a special hot list as they get near
> > the end of their queue.
> >
> 
> Thanks I'll take a look at these suggestions.  For these optimizations
> to have a chance I think we need stripes to maintain affinity with the
> first core that picks up the work.  Currently all stripes take a trip
> through the single-threaded raid5d when their reference count drops to
> zero, only to be immediately reissued to the thread pool potentially
> on a different core (but I need to back this assumption up with more
> profiling).

Well, I'd definitely say the pool is going to give you a random
core.

> 
> > There's a rule somewhere that patches renaming things must have replies
> > questioning the new name.  The first reply isn't actually allowed to
> > suggest a better name, which is good because I'm not very good at
> > that kind of thing.
> >
> > Really though, btr_queue is fine by me, but don't feel obligated to keep
> > some variation of btrfs in the name ;)
> 
> btr_queue seemed to make sense since it's spreading work like "butter" :-).

Grin, it doesn't take many butter jokes to sell me.  Careful though, you
can't have just one buttery thread pool.  Just wait, you'll think of
good reasons to add another real soon.

Speaking of which, you want to use the atomic thread starting helper.
Basically kthread_run does GFP allocations that can wait on disk, and
if you're the disk it is waiting on things can go badly in a hurry.

We should probably integrate a single kthread run helper thread into the
pools instead of forcing people to make their own.  I'll take a look.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html