On Thu, Mar 28 2019, Song Liu wrote: > On Thu, Mar 28, 2019 at 9:04 AM Xiao Ni <xni@xxxxxxxxxx> wrote: >> >> >> >> On 03/28/2019 01:20 PM, Xiao Ni wrote: >> > >> > >> > On 03/28/2019 06:28 AM, NeilBrown wrote: >> >> On Wed, Mar 27 2019, Xiao Ni wrote: >> >> >> >>> ----- Original Message ----- >> >>>> From: "NeilBrown" <neilb@xxxxxxxx> >> >>>> To: "Xiao Ni" <xni@xxxxxxxxxx>, "Ming Lei" <ming.lei@xxxxxxxxxx> >> >>>> Cc: linux-raid@xxxxxxxxxxxxxxx >> >>>> Sent: Wednesday, March 27, 2019 8:19:28 AM >> >>>> Subject: [PATCH 0/2] md: change approach to parallel flush requests. >> >>>> >> >>>> Hi Xiao, >> >>>> I was looking at the issue of parallel flush requests being slow >> >>>> as a >> >>>> customer reported it against a kernel of ours that doesn't have your >> >>>> patch. >> >>>> >> >>>> I reviewed you patch more carefully and noticed a problem. It calls >> >>>> submit_bio() multiple times from within a make_request_fn function. >> >>>> Calling submit_bio once is perfectly safe. Calling it twice >> >>>> requires >> >>>> care and usually involves one of the call submitting part of the >> >>>> original, and the other resubmitting the original with that part >> >>>> removed. >> >>>> Calling it more often is rarely safe. The problem is that all these >> >>>> bios get queued on current->biolist and don't even start being >> >>>> processed until the top level make_request_fn completes. This can >> >>>> result in deadlocks. >> >>> Hi Neil >> >>> >> >>> Thanks for pointing out this. >> >>> A: The flush bio from file system is waiting at ret = >> >>> q->make_request_fn(q, bio); >> >>> B: Function md_flush_request submits flush bios to raid member >> >>> disks. These bios >> >>> will be added into bio_list_on_stack[0] and return. >> >>> A: make_request_fn returns and handle those bios on >> >>> bio_list_on_stack[0] by >> >>> q->make_request_fn(q, bio); After handling those bios it returns. >> >>> >> >>> There is no deadlock, right? >> >> The deadlock would happen when md_flush_request() calls mempool_alloc() >> >> or bio_alloc_mddev(). >> >> If the kmalloc(GFP_NOIO) fails because there is no free memory, and if >> >> all the objects pre-allocated to the mempool are trapped on the >> >> bio_list_on_stack queue, then this can block indefinitely. >> > >> > Thanks for pointing this. >> >> >> >>>> Also, with your patch there can be lots of flush requests all being >> >>>> sent at much the same time, which all do essentially the same >> >>>> thing. This is fairly wasteful. >> >>>> >> >>>> I came up with a different approach. I restored the old because of >> >>>> flush requests being serialized, but now if a caller will not bother >> >>>> sending a flush if some other caller submitted a flush request while >> >>>> it waited. This should result in many fewer flush requests, and >> >>>> must >> >>>> less delay. >> >>> It can avoid sending flush bios which arrive at the same time. The >> >>> one which >> >>> get the lock can do the flush job. It's really a good idea. But if >> >>> most flush >> >>> bios are sent to md device in sequential, the mddev->last_flush is >> >>> usually before >> >>> start. In this case there is still lock contention. >> >> I'm not convinced that the problem is lock contention. The problem that >> >> you originally described was excessive latency. >> >> This was caused by all the requests being serialized. >> >> You fixed it by allowing requests to run in parallel. >> >> >> >> I'm suggest that we fix it by not running most of the requests. >> >> >> >> The caller will have to wait for the flush to complete. It rarely >> >> matters if it waits in generic_make_request() (as my solution will often >> >> do) , or if it waits after generic_make_request() has completed (as your >> >> solution does). >> >> >> >>> Is it better to use this patch with my method together? >> >>>> Would you be able to run you same test on this version to confirm >> >>>> that it fixes the performance issue, and if so review the code to >> >>>> see >> >>>> if you are happy with it. >> >>> I'll do a test with this patch. But it's very a simple environment. >> >>> Can you ask >> >>> customer do the test too? >> >> My customer has been testing this code for about 6 weeks, but they don't >> >> have a reliable reproducer so are not yet certain that it makes a >> >> difference. We do know that it doesn't make things a lot worse. >> >> As you had a reproducer, even if it is rather simple, testing against >> >> that would increase my confidence in that code. >> > >> > I'll give the test result later. >> >> Hi Neil >> >> I did a test and your patch works well in my environment. The results are: >> >> 1. The latest linux stable kernel >> Operation Count AvgLat MaxLat >> -------------------------------------------------- >> Deltree 768 10.509 78.305 >> Flush 2078376 0.013 10.094 >> Close 21787697 0.019 18.821 >> LockX 96580 0.007 3.184 >> Mkdir 384 0.008 0.062 >> Rename 1255883 0.191 23.534 >> ReadX 46495589 0.020 14.230 >> WriteX 14790591 7.123 60.706 >> Unlink 5989118 0.440 54.551 >> UnlockX 96580 0.005 2.736 >> FIND_FIRST 10393845 0.042 12.079 >> SET_FILE_INFORMATION 2415558 0.129 10.088 >> QUERY_FILE_INFORMATION 4711725 0.005 8.462 >> QUERY_PATH_INFORMATION 26883327 0.032 21.715 >> QUERY_FS_INFORMATION 4929409 0.010 8.238 >> NTCreateX 29660080 0.100 53.268 >> >> Throughput 1034.88 MB/sec (sync open) 128 clients 128 procs >> max_latency=60.712 ms >> >> 2. With patch1 "Revert "MD: fix lock contention for flush bios"" >> Operation Count AvgLat MaxLat >> -------------------------------------------------- >> Deltree 256 8.326 36.761 >> Flush 693291 3.974 180.269 >> Close 7266404 0.009 36.929 >> LockX 32160 0.006 0.840 >> Mkdir 128 0.008 0.021 >> Rename 418755 0.063 29.945 >> ReadX 15498708 0.007 7.216 >> WriteX 4932310 22.482 267.928 >> Unlink 1997557 0.109 47.553 >> UnlockX 32160 0.004 1.110 >> FIND_FIRST 3465791 0.036 7.320 >> SET_FILE_INFORMATION 805825 0.015 1.561 >> QUERY_FILE_INFORMATION 1570950 0.005 2.403 >> QUERY_PATH_INFORMATION 8965483 0.013 14.277 >> QUERY_FS_INFORMATION 1643626 0.009 3.314 >> NTCreateX 9892174 0.061 41.278 >> >> Throughput 345.009 MB/sec (sync open) 128 clients 128 procs >> max_latency=267.939 m >> >> 3. With patch1 and patch2 >> Operation Count AvgLat MaxLat >> -------------------------------------------------- >> Deltree 768 9.570 54.588 >> Flush 2061354 0.666 15.102 >> Close 21604811 0.012 25.697 >> LockX 95770 0.007 1.424 >> Mkdir 384 0.008 0.053 >> Rename 1245411 0.096 12.263 >> ReadX 46103198 0.011 12.116 >> WriteX 14667988 7.375 60.069 >> Unlink 5938936 0.173 30.905 >> UnlockX 95770 0.005 4.147 >> FIND_FIRST 10306407 0.041 11.715 >> SET_FILE_INFORMATION 2395987 0.048 7.640 >> QUERY_FILE_INFORMATION 4672371 0.005 9.291 >> QUERY_PATH_INFORMATION 26656735 0.018 19.719 >> QUERY_FS_INFORMATION 4887940 0.010 7.654 >> NTCreateX 29410811 0.059 28.551 >> >> Throughput 1026.21 MB/sec (sync open) 128 clients 128 procs >> max_latency=60.075 ms >> >> I did the test on a raid10 device which is created by 4 SSDs. The tool I >> used is dbench. >> >> Regards >> Xiao > > Thanks Neil and Xiao! I will process the patch. Yes, thanks a lot Xiao for testing. The numbers do look good. And thanks Song for processing the patches. NeilBrown > > Xiao, I will add your Tested-by tag. Would you like to attach > Reviewed-by or Acked-by as well? > > Song > >> >> >>
Attachment:
signature.asc
Description: PGP signature