On 4/25/22 16:05, yukuai (C) wrote: > 在 2022/04/25 14:50, Damien Le Moal 写道: >> On 4/25/22 15:47, yukuai (C) wrote: >>> 在 2022/04/25 14:23, Damien Le Moal 写道: >>>> On 4/25/22 15:14, yukuai (C) wrote: >>>>> 在 2022/04/25 11:24, Damien Le Moal 写道: >>>>>> On 4/24/22 11:43, yukuai (C) wrote: >>>>>>> friendly ping ... >>>>>>> >>>>>>> 在 2022/04/15 18:10, Yu Kuai 写道: >>>>>>>> Changes in v3: >>>>>>>> - update 'waiters_cnt' before 'ws_active' in sbitmap_prepare_to_wait() >>>>>>>> in patch 1, in case __sbq_wake_up() see 'ws_active > 0' while >>>>>>>> 'waiters_cnt' are all 0, which will cause deap loop. >>>>>>>> - don't add 'wait_index' during each loop in patch 2 >>>>>>>> - fix that 'wake_index' might mismatch in the first wake up in patch 3, >>>>>>>> also improving coding for the patch. >>>>>>>> - add a detection in patch 4 in case io hung is triggered in corner >>>>>>>> cases. >>>>>>>> - make the detection, free tags are sufficient, more flexible. >>>>>>>> - fix a race in patch 8. >>>>>>>> - fix some words and add some comments. >>>>>>>> >>>>>>>> Changes in v2: >>>>>>>> - use a new title >>>>>>>> - add patches to fix waitqueues' unfairness - path 1-3 >>>>>>>> - delete patch to add queue flag >>>>>>>> - delete patch to split big io thoroughly >>>>>>>> >>>>>>>> In this patchset: >>>>>>>> - patch 1-3 fix waitqueues' unfairness. >>>>>>>> - patch 4,5 disable tag preemption on heavy load. >>>>>>>> - patch 6 forces tag preemption for split bios. >>>>>>>> - patch 7,8 improve large random io for HDD. We do meet the problem and >>>>>>>> I'm trying to fix it at very low cost. However, if anyone still thinks >>>>>>>> this is not a common case and not worth to optimize, I'll drop them. >>>>>>>> >>>>>>>> There is a defect for blk-mq compare to blk-sq, specifically split io >>>>>>>> will end up discontinuous if the device is under high io pressure, while >>>>>>>> split io will still be continuous in sq, this is because: >>>>>>>> >>>>>>>> 1) new io can preempt tag even if there are lots of threads waiting. >>>>>>>> 2) split bio is issued one by one, if one bio can't get tag, it will go >>>>>>>> to wail. >>>>>>>> 3) each time 8(or wake batch) requests is done, 8 waiters will be woken up. >>>>>>>> Thus if a thread is woken up, it will unlikey to get multiple tags. >>>>>>>> >>>>>>>> The problem was first found by upgrading kernel from v3.10 to v4.18, >>>>>>>> test device is HDD with 256 'max_sectors_kb', and test case is issuing 1m >>>>>>>> ios with high concurrency. >>>>>>>> >>>>>>>> Noted that there is a precondition for such performance problem: >>>>>>>> There is a certain gap between bandwidth for single io with >>>>>>>> bs=max_sectors_kb and disk upper limit. >>>>>>>> >>>>>>>> During the test, I found that waitqueues can be extremly unbalanced on >>>>>>>> heavy load. This is because 'wake_index' is not set properly in >>>>>>>> __sbq_wake_up(), see details in patch 3. >>>>>>>> >>>>>>>> Test environment: >>>>>>>> arm64, 96 core with 200 BogoMIPS, test device is HDD. The default >>>>>>>> 'max_sectors_kb' is 1280(Sorry that I was unable to test on the machine >>>>>>>> where 'max_sectors_kb' is 256).>> >>>>>>>> The single io performance(randwrite): >>>>>>>> >>>>>>>> | bs | 128k | 256k | 512k | 1m | 1280k | 2m | 4m | >>>>>>>> | -------- | ---- | ---- | ---- | ---- | ----- | ---- | ---- | >>>>>>>> | bw MiB/s | 20.1 | 33.4 | 51.8 | 67.1 | 74.7 | 82.9 | 82.9 | >>>>>> >>>>>> These results are extremely strange, unless you are running with the >>>>>> device write cache disabled ? If you have the device write cache enabled, >>>>>> the problem you mention above would be most likely completely invisible, >>>>>> which I guess is why nobody really noticed any issue until now. >>>>>> >>>>>> Similarly, with reads, the device side read-ahead may hide the problem, >>>>>> albeit that depends on how "intelligent" the drive is at identifying >>>>>> sequential accesses. >>>>>> >>>>>>>> >>>>>>>> It can be seen that 1280k io is already close to upper limit, and it'll >>>>>>>> be hard to see differences with the default value, thus I set >>>>>>>> 'max_sectors_kb' to 128 in the following test. >>>>>>>> >>>>>>>> Test cmd: >>>>>>>> fio \ >>>>>>>> -filename=/dev/$dev \ >>>>>>>> -name=test \ >>>>>>>> -ioengine=psync \ >>>>>>>> -allow_mounted_write=0 \ >>>>>>>> -group_reporting \ >>>>>>>> -direct=1 \ >>>>>>>> -offset_increment=1g \ >>>>>>>> -rw=randwrite \ >>>>>>>> -bs=1024k \ >>>>>>>> -numjobs={1,2,4,8,16,32,64,128,256,512} \ >>>>>>>> -runtime=110 \ >>>>>>>> -ramp_time=10 >>>>>>>> >>>>>>>> Test result: MiB/s >>>>>>>> >>>>>>>> | numjobs | v5.18-rc1 | v5.18-rc1-patched | >>>>>>>> | ------- | --------- | ----------------- | >>>>>>>> | 1 | 67.7 | 67.7 | >>>>>>>> | 2 | 67.7 | 67.7 | >>>>>>>> | 4 | 67.7 | 67.7 | >>>>>>>> | 8 | 67.7 | 67.7 | >>>>>>>> | 16 | 64.8 | 65.6 | >>>>>>>> | 32 | 59.8 | 63.8 | >>>>>>>> | 64 | 54.9 | 59.4 | >>>>>>>> | 128 | 49 | 56.9 | >>>>>>>> | 256 | 37.7 | 58.3 | >>>>>>>> | 512 | 31.8 | 57.9 | >>>>>> >>>>>> Device write cache disabled ? >>>>>> >>>>>> Also, what is the max QD of this disk ? >>>>>> >>>>>> E.g., if it is SATA, it is 32, so you will only get at most 64 scheduler >>>>>> tags. So for any of your tests with more than 64 threads, many of the >>>>>> threads will be waiting for a scheduler tag for the BIO before the >>>>>> bio_split problem you explain triggers. Given that the numbers you show >>>>>> are the same for before-after patch with a number of threads <= 64, I am >>>>>> tempted to think that the problem is not really BIO splitting... >>>>>> >>>>>> What about random read workloads ? What kind of results do you see ? >>>>> >>>>> Hi, >>>>> >>>>> Sorry about the misleading of this test case. >>>>> >>>>> This testcase is high concurrency huge randwrite, it's just for the >>>>> problem that split bios won't be issued continuously, which is the >>>>> root cause of the performance degradation as the numjobs increases. >>>>> >>>>> queue_depth is 32, and numjobs is 64, thus when numjobs is not greater >>>>> than 8, performance is fine, because the ratio of sequential io should >>>>> be 7/8. However, as numjobs increases, performance is worse because >>>>> the ratio is lower. For example, when numjobs is 512, the ratio of >>>>> sequential io is about 20%. >>>> >>>> But with 512 jobs, you will get only 64 jobs only with IOs in the queue. >>>> All other jobs will be waiting for a scheduler tag before being able to >>>> issue their large BIO. No ? >>> >>> Hi, >>> >>> It's right. >>> >>> In fact, after this patchset, since each large io will need total 8 >>> tags, only 8 jobs can be in the queue while others are waiting for >>> scheduler tag. >>> >>>> >>>> It sounds like the set of scheduler tags should be a bit more elastic: >>>> always allow BIOs from a split of a large BIO to be submitted (that is to >>>> get a scheduler tag) even if that causes a temporary excess of the number >>>> of requests beyond the default number of scheduler tags. Doing so, all >>>> fragments of a large BIOs can be queued immediately. From there, if the >>>> scheduler operates correctly, all the requests from the large BIOs split >>>> would be issued in sequence to the device. >>> >>> This solution sounds feasible in theory, however, I'm not sure yet how >>> to implement that 'temporary excess'. >> >> It should not be too hard. > > I'll try to figure out a proper way, in the meantime, any suggestions > would be appreciated. >> >> By the way, did you check that doing something like: >> >> echo 2048 > /sys/block/sdX/queue/nr_requests >> >> improves performance for your high number of jobs test case ? > > Yes, performance will not degrade when numjobs is not greater than 256 > in this case. That is my thinking as well. I am asking if did check that (did you run it ?). >> >>> >>> Thanks, >>> Kuai >>>> >>>> >>>>> >>>>> patch 6-8 will let split bios still be issued continuously under high >>>>> pressure. >>>>> >>>>> Thanks, >>>>> Kuai >>>>> >>>> >>>> >> >> -- Damien Le Moal Western Digital Research