Re: [PATCH 0/2] md: change approach to parallel flush requests.

Xiao Ni <xni@xxxxxxxxxx> · Fri, 29 Mar 2019 00:03:41 +0800

On 03/28/2019 01:20 PM, Xiao Ni wrote:

On 03/28/2019 06:28 AM, NeilBrown wrote:
On Wed, Mar 27 2019, Xiao Ni wrote:

----- Original Message -----
From: "NeilBrown" <neilb@xxxxxxxx>
To: "Xiao Ni" <xni@xxxxxxxxxx>, "Ming Lei" <ming.lei@xxxxxxxxxx>
Cc: linux-raid@xxxxxxxxxxxxxxx
Sent: Wednesday, March 27, 2019 8:19:28 AM
Subject: [PATCH 0/2] md: change approach to parallel flush requests.

Hi Xiao,
  I was looking at the issue of parallel flush requests being slow 
as a
  customer reported it against a kernel of ours that doesn't have your
  patch.

  I reviewed you patch more carefully and noticed a problem.  It calls
  submit_bio() multiple times from within a make_request_fn function.
  Calling submit_bio once is perfectly safe.  Calling it twice 
requires
  care and usually involves one of the call submitting part of the
  original, and the other resubmitting the original with that part
  removed.
  Calling it more often is rarely safe.  The problem is that all these
  bios get queued on current->biolist and don't even start being
  processed until the top level make_request_fn completes. This can
  result in deadlocks.
Hi Neil

Thanks for pointing out this.
A: The flush bio from file system is waiting at ret = 
q->make_request_fn(q, bio);
B: Function md_flush_request submits flush bios to raid member 
disks. These bios
will be added into bio_list_on_stack[0] and return.
A: make_request_fn returns and handle those bios on 
bio_list_on_stack[0] by
q->make_request_fn(q, bio); After handling those bios it returns.

There is no deadlock, right?
The deadlock would happen when md_flush_request() calls mempool_alloc()
or bio_alloc_mddev().
If the kmalloc(GFP_NOIO) fails because there is no free memory, and if
all the objects pre-allocated to the mempool are trapped on the
bio_list_on_stack queue, then this can block indefinitely.

Thanks for pointing this.

  Also, with your patch there can be lots of flush requests all being
  sent at much the same time, which all do essentially the same
  thing.  This is fairly wasteful.

  I came up with a different approach.  I restored the old because of
  flush requests being serialized, but now if a caller will not bother
  sending a flush if some other caller submitted a flush request while
  it waited.  This should result in many fewer flush requests, and 
must
  less delay.
It can avoid sending flush bios which arrive at the same time. The 
one which
get the lock can do the flush job. It's really a good idea. But if 
most flush
bios are sent to md device in sequential, the mddev->last_flush is 
usually before
start. In this case there is still lock contention.
I'm not convinced that the problem is lock contention.  The problem that
you originally described was excessive latency.
This was caused by all the requests being serialized.
You fixed it by allowing requests to run in parallel.

I'm suggest that we fix it by not running most of the requests.

The caller will have to wait for the flush to complete.  It rarely
matters if it waits in generic_make_request() (as my solution will often
do) , or if it waits after generic_make_request() has completed (as your
solution does).

Is it better to use this patch with my method together?
  Would you be able to run you same test on this version to confirm
  that it fixes the performance issue, and if so review the code to 
see
  if you are happy with it.
I'll do a test with this patch. But it's very a simple environment. 
Can you ask
customer do the test too?
My customer has been testing this code for about 6 weeks, but they don't
have a reliable reproducer so are not yet certain that it makes a
difference.  We do know that it doesn't make things a lot worse.
As you had a reproducer, even if it is rather simple, testing against
that would increase my confidence in that code.

I'll give the test result later.

Hi Neil

I did a test and your patch works well in my environment. The results are:

1. The latest linux stable kernel
 Operation                Count    AvgLat    MaxLat
 --------------------------------------------------
 Deltree                    768    10.509    78.305
 Flush                  2078376     0.013    10.094
 Close                  21787697     0.019    18.821
 LockX                    96580     0.007     3.184
 Mkdir                      384     0.008     0.062
 Rename                 1255883     0.191    23.534
 ReadX                  46495589     0.020    14.230
 WriteX                 14790591     7.123    60.706
 Unlink                 5989118     0.440    54.551
 UnlockX                  96580     0.005     2.736
 FIND_FIRST             10393845     0.042    12.079
 SET_FILE_INFORMATION   2415558     0.129    10.088
 QUERY_FILE_INFORMATION 4711725     0.005     8.462
 QUERY_PATH_INFORMATION 26883327     0.032    21.715
 QUERY_FS_INFORMATION   4929409     0.010     8.238
 NTCreateX              29660080     0.100    53.268

Throughput 1034.88 MB/sec (sync open)  128 clients  128 procs 
max_latency=60.712 ms

2. With patch1 "Revert "MD: fix lock contention for flush bios""
 Operation                Count    AvgLat    MaxLat
 --------------------------------------------------
 Deltree                    256     8.326    36.761
 Flush                   693291     3.974   180.269
 Close                  7266404     0.009    36.929
 LockX                    32160     0.006     0.840
 Mkdir                      128     0.008     0.021
 Rename                  418755     0.063    29.945
 ReadX                  15498708     0.007     7.216
 WriteX                 4932310    22.482   267.928
 Unlink                 1997557     0.109    47.553
 UnlockX                  32160     0.004     1.110
 FIND_FIRST             3465791     0.036     7.320
 SET_FILE_INFORMATION    805825     0.015     1.561
 QUERY_FILE_INFORMATION 1570950     0.005     2.403
 QUERY_PATH_INFORMATION 8965483     0.013    14.277
 QUERY_FS_INFORMATION   1643626     0.009     3.314
 NTCreateX              9892174     0.061    41.278

Throughput 345.009 MB/sec (sync open)  128 clients  128 procs 
max_latency=267.939 m

3. With patch1 and patch2
 Operation                Count    AvgLat    MaxLat
 --------------------------------------------------
 Deltree                    768     9.570    54.588
 Flush                  2061354     0.666    15.102
 Close                  21604811     0.012    25.697
 LockX                    95770     0.007     1.424
 Mkdir                      384     0.008     0.053
 Rename                 1245411     0.096    12.263
 ReadX                  46103198     0.011    12.116
 WriteX                 14667988     7.375    60.069
 Unlink                 5938936     0.173    30.905
 UnlockX                  95770     0.005     4.147
 FIND_FIRST             10306407     0.041    11.715
 SET_FILE_INFORMATION   2395987     0.048     7.640
 QUERY_FILE_INFORMATION 4672371     0.005     9.291
 QUERY_PATH_INFORMATION 26656735     0.018    19.719
 QUERY_FS_INFORMATION   4887940     0.010     7.654
 NTCreateX              29410811     0.059    28.551

Throughput 1026.21 MB/sec (sync open)  128 clients  128 procs 
max_latency=60.075 ms

I did the test on a raid10 device which is created by 4 SSDs. The tool I 
used is dbench.

Regards
Xiao