Re: [PATCH 0/2] md: change approach to parallel flush requests.

Song Liu <liu.song.a23@xxxxxxxxx> · Thu, 28 Mar 2019 09:58:28 -0700

On Thu, Mar 28, 2019 at 9:04 AM Xiao Ni <xni@xxxxxxxxxx> wrote:
>
>
>
> On 03/28/2019 01:20 PM, Xiao Ni wrote:
> >
> >
> > On 03/28/2019 06:28 AM, NeilBrown wrote:
> >> On Wed, Mar 27 2019, Xiao Ni wrote:
> >>
> >>> ----- Original Message -----
> >>>> From: "NeilBrown" <neilb@xxxxxxxx>
> >>>> To: "Xiao Ni" <xni@xxxxxxxxxx>, "Ming Lei" <ming.lei@xxxxxxxxxx>
> >>>> Cc: linux-raid@xxxxxxxxxxxxxxx
> >>>> Sent: Wednesday, March 27, 2019 8:19:28 AM
> >>>> Subject: [PATCH 0/2] md: change approach to parallel flush requests.
> >>>>
> >>>> Hi Xiao,
> >>>>   I was looking at the issue of parallel flush requests being slow
> >>>> as a
> >>>>   customer reported it against a kernel of ours that doesn't have your
> >>>>   patch.
> >>>>
> >>>>   I reviewed you patch more carefully and noticed a problem.  It calls
> >>>>   submit_bio() multiple times from within a make_request_fn function.
> >>>>   Calling submit_bio once is perfectly safe.  Calling it twice
> >>>> requires
> >>>>   care and usually involves one of the call submitting part of the
> >>>>   original, and the other resubmitting the original with that part
> >>>>   removed.
> >>>>   Calling it more often is rarely safe.  The problem is that all these
> >>>>   bios get queued on current->biolist and don't even start being
> >>>>   processed until the top level make_request_fn completes. This can
> >>>>   result in deadlocks.
> >>> Hi Neil
> >>>
> >>> Thanks for pointing out this.
> >>> A: The flush bio from file system is waiting at ret =
> >>> q->make_request_fn(q, bio);
> >>> B: Function md_flush_request submits flush bios to raid member
> >>> disks. These bios
> >>> will be added into bio_list_on_stack[0] and return.
> >>> A: make_request_fn returns and handle those bios on
> >>> bio_list_on_stack[0] by
> >>> q->make_request_fn(q, bio); After handling those bios it returns.
> >>>
> >>> There is no deadlock, right?
> >> The deadlock would happen when md_flush_request() calls mempool_alloc()
> >> or bio_alloc_mddev().
> >> If the kmalloc(GFP_NOIO) fails because there is no free memory, and if
> >> all the objects pre-allocated to the mempool are trapped on the
> >> bio_list_on_stack queue, then this can block indefinitely.
> >
> > Thanks for pointing this.
> >>
> >>>>   Also, with your patch there can be lots of flush requests all being
> >>>>   sent at much the same time, which all do essentially the same
> >>>>   thing.  This is fairly wasteful.
> >>>>
> >>>>   I came up with a different approach.  I restored the old because of
> >>>>   flush requests being serialized, but now if a caller will not bother
> >>>>   sending a flush if some other caller submitted a flush request while
> >>>>   it waited.  This should result in many fewer flush requests, and
> >>>> must
> >>>>   less delay.
> >>> It can avoid sending flush bios which arrive at the same time. The
> >>> one which
> >>> get the lock can do the flush job. It's really a good idea. But if
> >>> most flush
> >>> bios are sent to md device in sequential, the mddev->last_flush is
> >>> usually before
> >>> start. In this case there is still lock contention.
> >> I'm not convinced that the problem is lock contention.  The problem that
> >> you originally described was excessive latency.
> >> This was caused by all the requests being serialized.
> >> You fixed it by allowing requests to run in parallel.
> >>
> >> I'm suggest that we fix it by not running most of the requests.
> >>
> >> The caller will have to wait for the flush to complete.  It rarely
> >> matters if it waits in generic_make_request() (as my solution will often
> >> do) , or if it waits after generic_make_request() has completed (as your
> >> solution does).
> >>
> >>> Is it better to use this patch with my method together?
> >>>>   Would you be able to run you same test on this version to confirm
> >>>>   that it fixes the performance issue, and if so review the code to
> >>>> see
> >>>>   if you are happy with it.
> >>> I'll do a test with this patch. But it's very a simple environment.
> >>> Can you ask
> >>> customer do the test too?
> >> My customer has been testing this code for about 6 weeks, but they don't
> >> have a reliable reproducer so are not yet certain that it makes a
> >> difference.  We do know that it doesn't make things a lot worse.
> >> As you had a reproducer, even if it is rather simple, testing against
> >> that would increase my confidence in that code.
> >
> > I'll give the test result later.
>
> Hi Neil
>
> I did a test and your patch works well in my environment. The results are:
>
> 1. The latest linux stable kernel
>   Operation                Count    AvgLat    MaxLat
>   --------------------------------------------------
>   Deltree                    768    10.509    78.305
>   Flush                  2078376     0.013    10.094
>   Close                  21787697     0.019    18.821
>   LockX                    96580     0.007     3.184
>   Mkdir                      384     0.008     0.062
>   Rename                 1255883     0.191    23.534
>   ReadX                  46495589     0.020    14.230
>   WriteX                 14790591     7.123    60.706
>   Unlink                 5989118     0.440    54.551
>   UnlockX                  96580     0.005     2.736
>   FIND_FIRST             10393845     0.042    12.079
>   SET_FILE_INFORMATION   2415558     0.129    10.088
>   QUERY_FILE_INFORMATION 4711725     0.005     8.462
>   QUERY_PATH_INFORMATION 26883327     0.032    21.715
>   QUERY_FS_INFORMATION   4929409     0.010     8.238
>   NTCreateX              29660080     0.100    53.268
>
> Throughput 1034.88 MB/sec (sync open)  128 clients  128 procs
> max_latency=60.712 ms
>
> 2. With patch1 "Revert "MD: fix lock contention for flush bios""
>   Operation                Count    AvgLat    MaxLat
>   --------------------------------------------------
>   Deltree                    256     8.326    36.761
>   Flush                   693291     3.974   180.269
>   Close                  7266404     0.009    36.929
>   LockX                    32160     0.006     0.840
>   Mkdir                      128     0.008     0.021
>   Rename                  418755     0.063    29.945
>   ReadX                  15498708     0.007     7.216
>   WriteX                 4932310    22.482   267.928
>   Unlink                 1997557     0.109    47.553
>   UnlockX                  32160     0.004     1.110
>   FIND_FIRST             3465791     0.036     7.320
>   SET_FILE_INFORMATION    805825     0.015     1.561
>   QUERY_FILE_INFORMATION 1570950     0.005     2.403
>   QUERY_PATH_INFORMATION 8965483     0.013    14.277
>   QUERY_FS_INFORMATION   1643626     0.009     3.314
>   NTCreateX              9892174     0.061    41.278
>
> Throughput 345.009 MB/sec (sync open)  128 clients  128 procs
> max_latency=267.939 m
>
> 3. With patch1 and patch2
>   Operation                Count    AvgLat    MaxLat
>   --------------------------------------------------
>   Deltree                    768     9.570    54.588
>   Flush                  2061354     0.666    15.102
>   Close                  21604811     0.012    25.697
>   LockX                    95770     0.007     1.424
>   Mkdir                      384     0.008     0.053
>   Rename                 1245411     0.096    12.263
>   ReadX                  46103198     0.011    12.116
>   WriteX                 14667988     7.375    60.069
>   Unlink                 5938936     0.173    30.905
>   UnlockX                  95770     0.005     4.147
>   FIND_FIRST             10306407     0.041    11.715
>   SET_FILE_INFORMATION   2395987     0.048     7.640
>   QUERY_FILE_INFORMATION 4672371     0.005     9.291
>   QUERY_PATH_INFORMATION 26656735     0.018    19.719
>   QUERY_FS_INFORMATION   4887940     0.010     7.654
>   NTCreateX              29410811     0.059    28.551
>
> Throughput 1026.21 MB/sec (sync open)  128 clients  128 procs
> max_latency=60.075 ms
>
> I did the test on a raid10 device which is created by 4 SSDs. The tool I
> used is dbench.
>
> Regards
> Xiao

Thanks Neil and Xiao! I will process the patch.

Xiao, I will add your Tested-by tag. Would you like to attach
Reviewed-by or Acked-by as well?

Song

>
>
>