Re: [PATCH 0/2] md: change approach to parallel flush requests.

NeilBrown <neilb@xxxxxxxx> · Fri, 29 Mar 2019 14:15:59 +1100

On Thu, Mar 28 2019, Song Liu wrote:

> On Thu, Mar 28, 2019 at 9:04 AM Xiao Ni <xni@xxxxxxxxxx> wrote:
>>
>>
>>
>> On 03/28/2019 01:20 PM, Xiao Ni wrote:
>> >
>> >
>> > On 03/28/2019 06:28 AM, NeilBrown wrote:
>> >> On Wed, Mar 27 2019, Xiao Ni wrote:
>> >>
>> >>> ----- Original Message -----
>> >>>> From: "NeilBrown" <neilb@xxxxxxxx>
>> >>>> To: "Xiao Ni" <xni@xxxxxxxxxx>, "Ming Lei" <ming.lei@xxxxxxxxxx>
>> >>>> Cc: linux-raid@xxxxxxxxxxxxxxx
>> >>>> Sent: Wednesday, March 27, 2019 8:19:28 AM
>> >>>> Subject: [PATCH 0/2] md: change approach to parallel flush requests.
>> >>>>
>> >>>> Hi Xiao,
>> >>>>   I was looking at the issue of parallel flush requests being slow
>> >>>> as a
>> >>>>   customer reported it against a kernel of ours that doesn't have your
>> >>>>   patch.
>> >>>>
>> >>>>   I reviewed you patch more carefully and noticed a problem.  It calls
>> >>>>   submit_bio() multiple times from within a make_request_fn function.
>> >>>>   Calling submit_bio once is perfectly safe.  Calling it twice
>> >>>> requires
>> >>>>   care and usually involves one of the call submitting part of the
>> >>>>   original, and the other resubmitting the original with that part
>> >>>>   removed.
>> >>>>   Calling it more often is rarely safe.  The problem is that all these
>> >>>>   bios get queued on current->biolist and don't even start being
>> >>>>   processed until the top level make_request_fn completes. This can
>> >>>>   result in deadlocks.
>> >>> Hi Neil
>> >>>
>> >>> Thanks for pointing out this.
>> >>> A: The flush bio from file system is waiting at ret =
>> >>> q->make_request_fn(q, bio);
>> >>> B: Function md_flush_request submits flush bios to raid member
>> >>> disks. These bios
>> >>> will be added into bio_list_on_stack[0] and return.
>> >>> A: make_request_fn returns and handle those bios on
>> >>> bio_list_on_stack[0] by
>> >>> q->make_request_fn(q, bio); After handling those bios it returns.
>> >>>
>> >>> There is no deadlock, right?
>> >> The deadlock would happen when md_flush_request() calls mempool_alloc()
>> >> or bio_alloc_mddev().
>> >> If the kmalloc(GFP_NOIO) fails because there is no free memory, and if
>> >> all the objects pre-allocated to the mempool are trapped on the
>> >> bio_list_on_stack queue, then this can block indefinitely.
>> >
>> > Thanks for pointing this.
>> >>
>> >>>>   Also, with your patch there can be lots of flush requests all being
>> >>>>   sent at much the same time, which all do essentially the same
>> >>>>   thing.  This is fairly wasteful.
>> >>>>
>> >>>>   I came up with a different approach.  I restored the old because of
>> >>>>   flush requests being serialized, but now if a caller will not bother
>> >>>>   sending a flush if some other caller submitted a flush request while
>> >>>>   it waited.  This should result in many fewer flush requests, and
>> >>>> must
>> >>>>   less delay.
>> >>> It can avoid sending flush bios which arrive at the same time. The
>> >>> one which
>> >>> get the lock can do the flush job. It's really a good idea. But if
>> >>> most flush
>> >>> bios are sent to md device in sequential, the mddev->last_flush is
>> >>> usually before
>> >>> start. In this case there is still lock contention.
>> >> I'm not convinced that the problem is lock contention.  The problem that
>> >> you originally described was excessive latency.
>> >> This was caused by all the requests being serialized.
>> >> You fixed it by allowing requests to run in parallel.
>> >>
>> >> I'm suggest that we fix it by not running most of the requests.
>> >>
>> >> The caller will have to wait for the flush to complete.  It rarely
>> >> matters if it waits in generic_make_request() (as my solution will often
>> >> do) , or if it waits after generic_make_request() has completed (as your
>> >> solution does).
>> >>
>> >>> Is it better to use this patch with my method together?
>> >>>>   Would you be able to run you same test on this version to confirm
>> >>>>   that it fixes the performance issue, and if so review the code to
>> >>>> see
>> >>>>   if you are happy with it.
>> >>> I'll do a test with this patch. But it's very a simple environment.
>> >>> Can you ask
>> >>> customer do the test too?
>> >> My customer has been testing this code for about 6 weeks, but they don't
>> >> have a reliable reproducer so are not yet certain that it makes a
>> >> difference.  We do know that it doesn't make things a lot worse.
>> >> As you had a reproducer, even if it is rather simple, testing against
>> >> that would increase my confidence in that code.
>> >
>> > I'll give the test result later.
>>
>> Hi Neil
>>
>> I did a test and your patch works well in my environment. The results are:
>>
>> 1. The latest linux stable kernel
>>   Operation                Count    AvgLat    MaxLat
>>   --------------------------------------------------
>>   Deltree                    768    10.509    78.305
>>   Flush                  2078376     0.013    10.094
>>   Close                  21787697     0.019    18.821
>>   LockX                    96580     0.007     3.184
>>   Mkdir                      384     0.008     0.062
>>   Rename                 1255883     0.191    23.534
>>   ReadX                  46495589     0.020    14.230
>>   WriteX                 14790591     7.123    60.706
>>   Unlink                 5989118     0.440    54.551
>>   UnlockX                  96580     0.005     2.736
>>   FIND_FIRST             10393845     0.042    12.079
>>   SET_FILE_INFORMATION   2415558     0.129    10.088
>>   QUERY_FILE_INFORMATION 4711725     0.005     8.462
>>   QUERY_PATH_INFORMATION 26883327     0.032    21.715
>>   QUERY_FS_INFORMATION   4929409     0.010     8.238
>>   NTCreateX              29660080     0.100    53.268
>>
>> Throughput 1034.88 MB/sec (sync open)  128 clients  128 procs
>> max_latency=60.712 ms
>>
>> 2. With patch1 "Revert "MD: fix lock contention for flush bios""
>>   Operation                Count    AvgLat    MaxLat
>>   --------------------------------------------------
>>   Deltree                    256     8.326    36.761
>>   Flush                   693291     3.974   180.269
>>   Close                  7266404     0.009    36.929
>>   LockX                    32160     0.006     0.840
>>   Mkdir                      128     0.008     0.021
>>   Rename                  418755     0.063    29.945
>>   ReadX                  15498708     0.007     7.216
>>   WriteX                 4932310    22.482   267.928
>>   Unlink                 1997557     0.109    47.553
>>   UnlockX                  32160     0.004     1.110
>>   FIND_FIRST             3465791     0.036     7.320
>>   SET_FILE_INFORMATION    805825     0.015     1.561
>>   QUERY_FILE_INFORMATION 1570950     0.005     2.403
>>   QUERY_PATH_INFORMATION 8965483     0.013    14.277
>>   QUERY_FS_INFORMATION   1643626     0.009     3.314
>>   NTCreateX              9892174     0.061    41.278
>>
>> Throughput 345.009 MB/sec (sync open)  128 clients  128 procs
>> max_latency=267.939 m
>>
>> 3. With patch1 and patch2
>>   Operation                Count    AvgLat    MaxLat
>>   --------------------------------------------------
>>   Deltree                    768     9.570    54.588
>>   Flush                  2061354     0.666    15.102
>>   Close                  21604811     0.012    25.697
>>   LockX                    95770     0.007     1.424
>>   Mkdir                      384     0.008     0.053
>>   Rename                 1245411     0.096    12.263
>>   ReadX                  46103198     0.011    12.116
>>   WriteX                 14667988     7.375    60.069
>>   Unlink                 5938936     0.173    30.905
>>   UnlockX                  95770     0.005     4.147
>>   FIND_FIRST             10306407     0.041    11.715
>>   SET_FILE_INFORMATION   2395987     0.048     7.640
>>   QUERY_FILE_INFORMATION 4672371     0.005     9.291
>>   QUERY_PATH_INFORMATION 26656735     0.018    19.719
>>   QUERY_FS_INFORMATION   4887940     0.010     7.654
>>   NTCreateX              29410811     0.059    28.551
>>
>> Throughput 1026.21 MB/sec (sync open)  128 clients  128 procs
>> max_latency=60.075 ms
>>
>> I did the test on a raid10 device which is created by 4 SSDs. The tool I
>> used is dbench.
>>
>> Regards
>> Xiao
>
> Thanks Neil and Xiao! I will process the patch.

Yes, thanks a lot Xiao for testing.  The numbers do look good.

And thanks Song for processing the patches.

NeilBrown

>
> Xiao, I will add your Tested-by tag. Would you like to attach
> Reviewed-by or Acked-by as well?
>
> Song
>
>>
>>
>>
Attachment:
signature.asc

Description: PGP signature