On 03/28/2019 01:20 PM, Xiao Ni wrote:
On 03/28/2019 06:28 AM, NeilBrown wrote:
On Wed, Mar 27 2019, Xiao Ni wrote:
----- Original Message -----
From: "NeilBrown" <neilb@xxxxxxxx>
To: "Xiao Ni" <xni@xxxxxxxxxx>, "Ming Lei" <ming.lei@xxxxxxxxxx>
Cc: linux-raid@xxxxxxxxxxxxxxx
Sent: Wednesday, March 27, 2019 8:19:28 AM
Subject: [PATCH 0/2] md: change approach to parallel flush requests.
Hi Xiao,
I was looking at the issue of parallel flush requests being slow
as a
customer reported it against a kernel of ours that doesn't have your
patch.
I reviewed you patch more carefully and noticed a problem. It calls
submit_bio() multiple times from within a make_request_fn function.
Calling submit_bio once is perfectly safe. Calling it twice
requires
care and usually involves one of the call submitting part of the
original, and the other resubmitting the original with that part
removed.
Calling it more often is rarely safe. The problem is that all these
bios get queued on current->biolist and don't even start being
processed until the top level make_request_fn completes. This can
result in deadlocks.
Hi Neil
Thanks for pointing out this.
A: The flush bio from file system is waiting at ret =
q->make_request_fn(q, bio);
B: Function md_flush_request submits flush bios to raid member
disks. These bios
will be added into bio_list_on_stack[0] and return.
A: make_request_fn returns and handle those bios on
bio_list_on_stack[0] by
q->make_request_fn(q, bio); After handling those bios it returns.
There is no deadlock, right?
The deadlock would happen when md_flush_request() calls mempool_alloc()
or bio_alloc_mddev().
If the kmalloc(GFP_NOIO) fails because there is no free memory, and if
all the objects pre-allocated to the mempool are trapped on the
bio_list_on_stack queue, then this can block indefinitely.
Thanks for pointing this.
Also, with your patch there can be lots of flush requests all being
sent at much the same time, which all do essentially the same
thing. This is fairly wasteful.
I came up with a different approach. I restored the old because of
flush requests being serialized, but now if a caller will not bother
sending a flush if some other caller submitted a flush request while
it waited. This should result in many fewer flush requests, and
must
less delay.
It can avoid sending flush bios which arrive at the same time. The
one which
get the lock can do the flush job. It's really a good idea. But if
most flush
bios are sent to md device in sequential, the mddev->last_flush is
usually before
start. In this case there is still lock contention.
I'm not convinced that the problem is lock contention. The problem that
you originally described was excessive latency.
This was caused by all the requests being serialized.
You fixed it by allowing requests to run in parallel.
I'm suggest that we fix it by not running most of the requests.
The caller will have to wait for the flush to complete. It rarely
matters if it waits in generic_make_request() (as my solution will often
do) , or if it waits after generic_make_request() has completed (as your
solution does).
Is it better to use this patch with my method together?
Would you be able to run you same test on this version to confirm
that it fixes the performance issue, and if so review the code to
see
if you are happy with it.
I'll do a test with this patch. But it's very a simple environment.
Can you ask
customer do the test too?
My customer has been testing this code for about 6 weeks, but they don't
have a reliable reproducer so are not yet certain that it makes a
difference. We do know that it doesn't make things a lot worse.
As you had a reproducer, even if it is rather simple, testing against
that would increase my confidence in that code.
I'll give the test result later.
Hi Neil
I did a test and your patch works well in my environment. The results are:
1. The latest linux stable kernel
Operation Count AvgLat MaxLat
--------------------------------------------------
Deltree 768 10.509 78.305
Flush 2078376 0.013 10.094
Close 21787697 0.019 18.821
LockX 96580 0.007 3.184
Mkdir 384 0.008 0.062
Rename 1255883 0.191 23.534
ReadX 46495589 0.020 14.230
WriteX 14790591 7.123 60.706
Unlink 5989118 0.440 54.551
UnlockX 96580 0.005 2.736
FIND_FIRST 10393845 0.042 12.079
SET_FILE_INFORMATION 2415558 0.129 10.088
QUERY_FILE_INFORMATION 4711725 0.005 8.462
QUERY_PATH_INFORMATION 26883327 0.032 21.715
QUERY_FS_INFORMATION 4929409 0.010 8.238
NTCreateX 29660080 0.100 53.268
Throughput 1034.88 MB/sec (sync open) 128 clients 128 procs
max_latency=60.712 ms
2. With patch1 "Revert "MD: fix lock contention for flush bios""
Operation Count AvgLat MaxLat
--------------------------------------------------
Deltree 256 8.326 36.761
Flush 693291 3.974 180.269
Close 7266404 0.009 36.929
LockX 32160 0.006 0.840
Mkdir 128 0.008 0.021
Rename 418755 0.063 29.945
ReadX 15498708 0.007 7.216
WriteX 4932310 22.482 267.928
Unlink 1997557 0.109 47.553
UnlockX 32160 0.004 1.110
FIND_FIRST 3465791 0.036 7.320
SET_FILE_INFORMATION 805825 0.015 1.561
QUERY_FILE_INFORMATION 1570950 0.005 2.403
QUERY_PATH_INFORMATION 8965483 0.013 14.277
QUERY_FS_INFORMATION 1643626 0.009 3.314
NTCreateX 9892174 0.061 41.278
Throughput 345.009 MB/sec (sync open) 128 clients 128 procs
max_latency=267.939 m
3. With patch1 and patch2
Operation Count AvgLat MaxLat
--------------------------------------------------
Deltree 768 9.570 54.588
Flush 2061354 0.666 15.102
Close 21604811 0.012 25.697
LockX 95770 0.007 1.424
Mkdir 384 0.008 0.053
Rename 1245411 0.096 12.263
ReadX 46103198 0.011 12.116
WriteX 14667988 7.375 60.069
Unlink 5938936 0.173 30.905
UnlockX 95770 0.005 4.147
FIND_FIRST 10306407 0.041 11.715
SET_FILE_INFORMATION 2395987 0.048 7.640
QUERY_FILE_INFORMATION 4672371 0.005 9.291
QUERY_PATH_INFORMATION 26656735 0.018 19.719
QUERY_FS_INFORMATION 4887940 0.010 7.654
NTCreateX 29410811 0.059 28.551
Throughput 1026.21 MB/sec (sync open) 128 clients 128 procs
max_latency=60.075 ms
I did the test on a raid10 device which is created by 4 SSDs. The tool I
used is dbench.
Regards
Xiao