On Thu, Mar 28, 2019 at 9:04 AM Xiao Ni <xni@xxxxxxxxxx> wrote: > > > > On 03/28/2019 01:20 PM, Xiao Ni wrote: > > > > > > On 03/28/2019 06:28 AM, NeilBrown wrote: > >> On Wed, Mar 27 2019, Xiao Ni wrote: > >> > >>> ----- Original Message ----- > >>>> From: "NeilBrown" <neilb@xxxxxxxx> > >>>> To: "Xiao Ni" <xni@xxxxxxxxxx>, "Ming Lei" <ming.lei@xxxxxxxxxx> > >>>> Cc: linux-raid@xxxxxxxxxxxxxxx > >>>> Sent: Wednesday, March 27, 2019 8:19:28 AM > >>>> Subject: [PATCH 0/2] md: change approach to parallel flush requests. > >>>> > >>>> Hi Xiao, > >>>> I was looking at the issue of parallel flush requests being slow > >>>> as a > >>>> customer reported it against a kernel of ours that doesn't have your > >>>> patch. > >>>> > >>>> I reviewed you patch more carefully and noticed a problem. It calls > >>>> submit_bio() multiple times from within a make_request_fn function. > >>>> Calling submit_bio once is perfectly safe. Calling it twice > >>>> requires > >>>> care and usually involves one of the call submitting part of the > >>>> original, and the other resubmitting the original with that part > >>>> removed. > >>>> Calling it more often is rarely safe. The problem is that all these > >>>> bios get queued on current->biolist and don't even start being > >>>> processed until the top level make_request_fn completes. This can > >>>> result in deadlocks. > >>> Hi Neil > >>> > >>> Thanks for pointing out this. > >>> A: The flush bio from file system is waiting at ret = > >>> q->make_request_fn(q, bio); > >>> B: Function md_flush_request submits flush bios to raid member > >>> disks. These bios > >>> will be added into bio_list_on_stack[0] and return. > >>> A: make_request_fn returns and handle those bios on > >>> bio_list_on_stack[0] by > >>> q->make_request_fn(q, bio); After handling those bios it returns. > >>> > >>> There is no deadlock, right? > >> The deadlock would happen when md_flush_request() calls mempool_alloc() > >> or bio_alloc_mddev(). > >> If the kmalloc(GFP_NOIO) fails because there is no free memory, and if > >> all the objects pre-allocated to the mempool are trapped on the > >> bio_list_on_stack queue, then this can block indefinitely. > > > > Thanks for pointing this. > >> > >>>> Also, with your patch there can be lots of flush requests all being > >>>> sent at much the same time, which all do essentially the same > >>>> thing. This is fairly wasteful. > >>>> > >>>> I came up with a different approach. I restored the old because of > >>>> flush requests being serialized, but now if a caller will not bother > >>>> sending a flush if some other caller submitted a flush request while > >>>> it waited. This should result in many fewer flush requests, and > >>>> must > >>>> less delay. > >>> It can avoid sending flush bios which arrive at the same time. The > >>> one which > >>> get the lock can do the flush job. It's really a good idea. But if > >>> most flush > >>> bios are sent to md device in sequential, the mddev->last_flush is > >>> usually before > >>> start. In this case there is still lock contention. > >> I'm not convinced that the problem is lock contention. The problem that > >> you originally described was excessive latency. > >> This was caused by all the requests being serialized. > >> You fixed it by allowing requests to run in parallel. > >> > >> I'm suggest that we fix it by not running most of the requests. > >> > >> The caller will have to wait for the flush to complete. It rarely > >> matters if it waits in generic_make_request() (as my solution will often > >> do) , or if it waits after generic_make_request() has completed (as your > >> solution does). > >> > >>> Is it better to use this patch with my method together? > >>>> Would you be able to run you same test on this version to confirm > >>>> that it fixes the performance issue, and if so review the code to > >>>> see > >>>> if you are happy with it. > >>> I'll do a test with this patch. But it's very a simple environment. > >>> Can you ask > >>> customer do the test too? > >> My customer has been testing this code for about 6 weeks, but they don't > >> have a reliable reproducer so are not yet certain that it makes a > >> difference. We do know that it doesn't make things a lot worse. > >> As you had a reproducer, even if it is rather simple, testing against > >> that would increase my confidence in that code. > > > > I'll give the test result later. > > Hi Neil > > I did a test and your patch works well in my environment. The results are: > > 1. The latest linux stable kernel > Operation Count AvgLat MaxLat > -------------------------------------------------- > Deltree 768 10.509 78.305 > Flush 2078376 0.013 10.094 > Close 21787697 0.019 18.821 > LockX 96580 0.007 3.184 > Mkdir 384 0.008 0.062 > Rename 1255883 0.191 23.534 > ReadX 46495589 0.020 14.230 > WriteX 14790591 7.123 60.706 > Unlink 5989118 0.440 54.551 > UnlockX 96580 0.005 2.736 > FIND_FIRST 10393845 0.042 12.079 > SET_FILE_INFORMATION 2415558 0.129 10.088 > QUERY_FILE_INFORMATION 4711725 0.005 8.462 > QUERY_PATH_INFORMATION 26883327 0.032 21.715 > QUERY_FS_INFORMATION 4929409 0.010 8.238 > NTCreateX 29660080 0.100 53.268 > > Throughput 1034.88 MB/sec (sync open) 128 clients 128 procs > max_latency=60.712 ms > > 2. With patch1 "Revert "MD: fix lock contention for flush bios"" > Operation Count AvgLat MaxLat > -------------------------------------------------- > Deltree 256 8.326 36.761 > Flush 693291 3.974 180.269 > Close 7266404 0.009 36.929 > LockX 32160 0.006 0.840 > Mkdir 128 0.008 0.021 > Rename 418755 0.063 29.945 > ReadX 15498708 0.007 7.216 > WriteX 4932310 22.482 267.928 > Unlink 1997557 0.109 47.553 > UnlockX 32160 0.004 1.110 > FIND_FIRST 3465791 0.036 7.320 > SET_FILE_INFORMATION 805825 0.015 1.561 > QUERY_FILE_INFORMATION 1570950 0.005 2.403 > QUERY_PATH_INFORMATION 8965483 0.013 14.277 > QUERY_FS_INFORMATION 1643626 0.009 3.314 > NTCreateX 9892174 0.061 41.278 > > Throughput 345.009 MB/sec (sync open) 128 clients 128 procs > max_latency=267.939 m > > 3. With patch1 and patch2 > Operation Count AvgLat MaxLat > -------------------------------------------------- > Deltree 768 9.570 54.588 > Flush 2061354 0.666 15.102 > Close 21604811 0.012 25.697 > LockX 95770 0.007 1.424 > Mkdir 384 0.008 0.053 > Rename 1245411 0.096 12.263 > ReadX 46103198 0.011 12.116 > WriteX 14667988 7.375 60.069 > Unlink 5938936 0.173 30.905 > UnlockX 95770 0.005 4.147 > FIND_FIRST 10306407 0.041 11.715 > SET_FILE_INFORMATION 2395987 0.048 7.640 > QUERY_FILE_INFORMATION 4672371 0.005 9.291 > QUERY_PATH_INFORMATION 26656735 0.018 19.719 > QUERY_FS_INFORMATION 4887940 0.010 7.654 > NTCreateX 29410811 0.059 28.551 > > Throughput 1026.21 MB/sec (sync open) 128 clients 128 procs > max_latency=60.075 ms > > I did the test on a raid10 device which is created by 4 SSDs. The tool I > used is dbench. > > Regards > Xiao Thanks Neil and Xiao! I will process the patch. Xiao, I will add your Tested-by tag. Would you like to attach Reviewed-by or Acked-by as well? Song > > >