Re: [patch 2/4] raid1: percpu dispatch for write request if bitmap supported

NeilBrown <neilb@xxxxxxx> · Thu, 24 May 2012 13:34:02 +1000

On Wed, 23 May 2012 15:26:21 +0800 Shaohua Li <shli@xxxxxxxxxx> wrote:

> In raid1, all write requests are dispatched in raid1d thread. In fast storage,
> the raid1d thread is a bottleneck, because it dispatches request too slow. Also
> raid1d thread migrates freely, which makes request completion cpu not match
> with submission cpu even driver/block layer has such capability. This will
> cause bad cache issue.
> 
> If bitmap support is enabled, write requests can only be dispatched after dirty
> bitmap is flushed out. After bitmap is flushed, how write requests are
> dispatched doesn't impact correctness. A natural idea is to distribute request
> dispatch to several threads. With this patch, requests are added to a percpu
> list first. After bitmap is flushed, then the percpu list requests will
> dispatched in a workqueue. In this way, above bottleneck is removed.
> 
> In a 4k randwrite test with a 2 disks setup, below patch can provide 10% ~ 50%
> performance improvements depending on numa binding.

Those numbers are quite impressive so there is certainly room for improvement
here.  I'm not sure that I'm entirely comfortable with the approach though.

Passing the request to a per-cpu thread does make sense, but a generic
per-cpu thread feels dangerous as we don't know what else might be queued to
that thread and because of the potential for deadlocks between memory
allocation and generic_make_request (as mentioned in previous email) I find
it hard to convince myself that this approach is entirely safe.

I wonder if we might take a very different approach and try to do everything
in the one process.  i.e. don't hand tasks off to other threads at all - at
least in the common case.

So we could change plugger_unplug (in md.c) so that:

 - if current->bio_list is NULL, (meaning all requests have been submitted
   and there is no risk of deadlock) we call bitmap_unplug and then submit
   all the queued writes.
 - if current->bio_list is not NULL, then we just wakeup the md thread to
   do the work.

In the common case bio_list should be NULL (I think) and everything is done
in the one process.  It is only when memory is tight and we reschedule while
waiting for memory that the md thread will be called on and hopefully slower
performance then won't be so noticeable.

We probably want separate queues of pending writes for each CPU (or each
thread) so that each "plugger_unplug" only submits the writes that were
queued by that process - so that if multiple threads are submitting writes at
the same time, they don't end up handling each other's requests.

This might end up flushing the bitmap slightly more often - the current code
allocates one "md_plug_cb" for each different thread that is concurrently
writing to a given md device, and don't trigger the flush until all of them
unplug.  My proposal above would trigger it when any thread triggers the
final unplug.  I wonder if that makes any difference in practice.

Anyway, could you explore the possibility of submitting the requests in
plugger_unplug when current->bio_list is NULL?
Possibly you could:
  get mddev_check_plugged to return the md_plug_cb
  link all the md_plug_cb's for one mddev together
  store the list of queued write requests in that md_plug_cb

Make sense?

Thanks,
NeilBrown
Attachment:
signature.asc

Description: PGP signature