Re: Question about RAID1 plug/unplug code

Alexander Lyakas <alex.bolshoy@xxxxxxxxx> · Wed, 10 Sep 2014 11:01:30 +0300

Hello Neil,

On Tue, Sep 9, 2014 at 12:45 PM, NeilBrown <neilb@xxxxxxx> wrote:
> On Tue, 9 Sep 2014 11:33:13 +0300 Alexander Lyakas <alex.bolshoy@xxxxxxxxx>
> wrote:
>
>> Hi Neil,
>>
>>
>> On Tue, Sep 9, 2014 at 4:45 AM, NeilBrown <neilb@xxxxxxx> wrote:
>> > On Mon, 8 Sep 2014 16:55:52 +0300 Alexander Lyakas <alex.bolshoy@xxxxxxxxx>
>> > wrote:
>> >
>> >> Hi Neil,
>> >> We have been seeing high latency on the md/raid1 block device, due to
>> >> the fact that all WRITEs are handed off to raid1d thread. This thread
>> >> also calls bitmap_unplug(), which writes the bitmap synchronously.
>> >> While it waits for the bitmap, it cannot trigger other WRITEs waiting
>> >> in its pending_bio_list. This is especially seen with SSDs: MD's
>> >> latency is much higher that SSD latency (I have been stoned by Peter
>> >> Grandi when I brought up this issue previously for raid5).
>> >>
>> >> Then I have noticed the commit:
>> >>
>> >> commit f54a9d0e59c4bea3db733921ca9147612a6f292c
>> >> Author: NeilBrown <neilb@xxxxxxx>
>> >> Date:   Thu Aug 2 08:33:20 2012 +1000
>> >>
>> >>     md/raid1: submit IO from originating thread instead of md thread.
>> >>
>> >> Looking at the code, I learned that to avoid switching into raid1d,
>> >> the caller has to use blk_start_plug/blk_finish_plug. So I added these
>> >> calls in our kernel module, which submits bios to MD. Results were
>> >> awesome, MD latency got down significantly.
>> >
>> > That's good to hear.
>> >
>> >>
>> >> So I have several questions about this plug/unplug thing.
>> >>
>> >> 1/ Originally this infrastructure was supposed to help IO schedulers
>> >> in merging requests. It is useful when one has a bunch of requests to
>> >> submit in one shot.
>> >
>> > That is exactly the whole point of plugging:  allow the device to handle a
>> > batch of requests together instead of one at a time.
>> >
>> >> But in MD case, thus infrastructure is used for a different purpose:
>> >> not to merge requests (which may help bandwidth, but probably not
>> >> latency), but to avoid making raid1d a bottleneck, to be able to
>> >> submit requests from multiple threads in parallel, which brings down
>> >> latency significantly in our case. Indeed "struct blk_plug" has a
>> >> special "cb_list", which is used only by MD.
>> >
>> > I don't think the way md uses plugging is conceptually different from any
>> > other use: it is always about gathering a batch together.
>> > "cb_list" is handled by blk_check_plugged() which is also used by
>> > block/umem.c and btrfs.
>> >
>> > The base plugging code assumes that it is only gathering a batch of requests
>> > for a single device - if the target device changes then the batch is flushed.
>> > It also assumed that it was "struct request" that was batched.
>> > Devices like md that want to queue 'struct bio', something else was needed.
>> > Also with layered devices it can be useful to gather multiple batches for
>> > multiple layers.
>> > So I created "cb_list" etc and a more generic interface.
>> >
>> >> In my case I have only individual bios (not a bunch of bios), and I
>> >> after wrap them with plug/unplug, MD latency gets better. So we are
>> >> using the plug infrastructure for a different purpose.
>> >> Is my understanding correct? Was this your intention?
>> >
>> > I don't really understand what you are doing.  There is no point in using
>> > plugging for individual bios.  The  main point for raid1 writes is to gather
>> > a lot of writes together so that all multiple bitmap bits can be set all at
>> > once.
>> > It should be possible to submit individual bios directly from make_request
>> > without passing them to raid1d and without using plugging.
>> Can you pls explain how it is possible?
>> You have this code for WRITEs:
>>         cb = blk_check_plugged(raid1_unplug, mddev, sizeof(*plug));
>>         if (cb)
>>             plug = container_of(cb, struct raid1_plug_cb, cb);
>>         else
>>             plug = NULL;
>>         spin_lock_irqsave(&conf->device_lock, flags);
>>         if (plug) {
>>             bio_list_add(&plug->pending, mbio);
>>             plug->pending_cnt++;
>>         } else {
>>             bio_list_add(&conf->pending_bio_list, mbio);
>>             conf->pending_count++;
>>         }
>>         spin_unlock_irqrestore(&conf->device_lock, flags);
>>
>> If the thread blk_check_plugged returns NULL, then you always hand the
>> WRITE to raid1d. So the only option to avoid handoff to raid1d is for
>> the caller to plug. Otherwise, all WRITEs are handed off to raid1d and
>> latency becomes terrible.
>> So in my case, I use plug/unplug for individual bios only to avoid the
>> handoff to raid1d.
>> What am I missing in this analysis?
>
> if blk_check_plugged succeeds then it has arranged for raid1_unplug to be
> called a little later by that same process.
> So there is nothing to stop you calling raid1_unplug immediately.
>
> raid1_unplug essentially does:
>   bitmap_unplug()
>   generic_make_request()
>
> so you can very nearly just do that, without any plugging.
I am sorry, but I did not understand your reply. Maybe I did not
explain myself, I will try again.

I am not changing raid1.c code. I just want to avoid the handoff to
raid1d on WRITEs. According to your code, there are only two possible
flows:

Flow 1 - with plugging
# caller calls blk_start_plug
# caller calls submit_bio
# blk_check_plugged succeeds, and bio is put onto plug->pending list
# caller calls blk_finish_plug
# raid1_unplug is called in the same caller's thread, so it does
bitmap_unplug and generic_make_request

Flow 2 - without plugging
# caller calls submit_bio
# blk_check_plugged fails, and bio is put onto conf->pending_bio_list,
which means it will be submitted by raid1d

My conclusion from that: to avoid the handoff to raid1, caller always
need to plug, even if it has a single bio to submit. But you said "it
should be possible to submit individual bios directly from
make_request without passing them to raid1d and without using
plugging". So can you explain how it is possible? I prefer not to
change raid1.c code.

>
> There is a bit of extra subtlety but I can't really know how relevant that
> might be to you without actually seeing you code.
My code (in a different kernel module, not in raid1.c) is simply doing
submit_bio. I want to wrap this with plug/unplug to avoid the handoff
to raid1d and improve raid1 latency.

Thanks,
Alex.

>
> NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html