On Wed, 23 May 2012 15:26:21 +0800 Shaohua Li <shli@xxxxxxxxxx> wrote: > In raid1, all write requests are dispatched in raid1d thread. In fast storage, > the raid1d thread is a bottleneck, because it dispatches request too slow. Also > raid1d thread migrates freely, which makes request completion cpu not match > with submission cpu even driver/block layer has such capability. This will > cause bad cache issue. > > If bitmap support is enabled, write requests can only be dispatched after dirty > bitmap is flushed out. After bitmap is flushed, how write requests are > dispatched doesn't impact correctness. A natural idea is to distribute request > dispatch to several threads. With this patch, requests are added to a percpu > list first. After bitmap is flushed, then the percpu list requests will > dispatched in a workqueue. In this way, above bottleneck is removed. > > In a 4k randwrite test with a 2 disks setup, below patch can provide 10% ~ 50% > performance improvements depending on numa binding. Those numbers are quite impressive so there is certainly room for improvement here. I'm not sure that I'm entirely comfortable with the approach though. Passing the request to a per-cpu thread does make sense, but a generic per-cpu thread feels dangerous as we don't know what else might be queued to that thread and because of the potential for deadlocks between memory allocation and generic_make_request (as mentioned in previous email) I find it hard to convince myself that this approach is entirely safe. I wonder if we might take a very different approach and try to do everything in the one process. i.e. don't hand tasks off to other threads at all - at least in the common case. So we could change plugger_unplug (in md.c) so that: - if current->bio_list is NULL, (meaning all requests have been submitted and there is no risk of deadlock) we call bitmap_unplug and then submit all the queued writes. - if current->bio_list is not NULL, then we just wakeup the md thread to do the work. In the common case bio_list should be NULL (I think) and everything is done in the one process. It is only when memory is tight and we reschedule while waiting for memory that the md thread will be called on and hopefully slower performance then won't be so noticeable. We probably want separate queues of pending writes for each CPU (or each thread) so that each "plugger_unplug" only submits the writes that were queued by that process - so that if multiple threads are submitting writes at the same time, they don't end up handling each other's requests. This might end up flushing the bitmap slightly more often - the current code allocates one "md_plug_cb" for each different thread that is concurrently writing to a given md device, and don't trigger the flush until all of them unplug. My proposal above would trigger it when any thread triggers the final unplug. I wonder if that makes any difference in practice. Anyway, could you explore the possibility of submitting the requests in plugger_unplug when current->bio_list is NULL? Possibly you could: get mddev_check_plugged to return the md_plug_cb link all the md_plug_cb's for one mddev together store the list of queued write requests in that md_plug_cb Make sense? Thanks, NeilBrown
Attachment:
signature.asc
Description: PGP signature