On 4/24/22 11:43, yukuai (C) wrote: > friendly ping ... > > 在 2022/04/15 18:10, Yu Kuai 写道: >> Changes in v3: >> - update 'waiters_cnt' before 'ws_active' in sbitmap_prepare_to_wait() >> in patch 1, in case __sbq_wake_up() see 'ws_active > 0' while >> 'waiters_cnt' are all 0, which will cause deap loop. >> - don't add 'wait_index' during each loop in patch 2 >> - fix that 'wake_index' might mismatch in the first wake up in patch 3, >> also improving coding for the patch. >> - add a detection in patch 4 in case io hung is triggered in corner >> cases. >> - make the detection, free tags are sufficient, more flexible. >> - fix a race in patch 8. >> - fix some words and add some comments. >> >> Changes in v2: >> - use a new title >> - add patches to fix waitqueues' unfairness - path 1-3 >> - delete patch to add queue flag >> - delete patch to split big io thoroughly >> >> In this patchset: >> - patch 1-3 fix waitqueues' unfairness. >> - patch 4,5 disable tag preemption on heavy load. >> - patch 6 forces tag preemption for split bios. >> - patch 7,8 improve large random io for HDD. We do meet the problem and >> I'm trying to fix it at very low cost. However, if anyone still thinks >> this is not a common case and not worth to optimize, I'll drop them. >> >> There is a defect for blk-mq compare to blk-sq, specifically split io >> will end up discontinuous if the device is under high io pressure, while >> split io will still be continuous in sq, this is because: >> >> 1) new io can preempt tag even if there are lots of threads waiting. >> 2) split bio is issued one by one, if one bio can't get tag, it will go >> to wail. >> 3) each time 8(or wake batch) requests is done, 8 waiters will be woken up. >> Thus if a thread is woken up, it will unlikey to get multiple tags. >> >> The problem was first found by upgrading kernel from v3.10 to v4.18, >> test device is HDD with 256 'max_sectors_kb', and test case is issuing 1m >> ios with high concurrency. >> >> Noted that there is a precondition for such performance problem: >> There is a certain gap between bandwidth for single io with >> bs=max_sectors_kb and disk upper limit. >> >> During the test, I found that waitqueues can be extremly unbalanced on >> heavy load. This is because 'wake_index' is not set properly in >> __sbq_wake_up(), see details in patch 3. >> >> Test environment: >> arm64, 96 core with 200 BogoMIPS, test device is HDD. The default >> 'max_sectors_kb' is 1280(Sorry that I was unable to test on the machine >> where 'max_sectors_kb' is 256).>> >> The single io performance(randwrite): >> >> | bs | 128k | 256k | 512k | 1m | 1280k | 2m | 4m | >> | -------- | ---- | ---- | ---- | ---- | ----- | ---- | ---- | >> | bw MiB/s | 20.1 | 33.4 | 51.8 | 67.1 | 74.7 | 82.9 | 82.9 | These results are extremely strange, unless you are running with the device write cache disabled ? If you have the device write cache enabled, the problem you mention above would be most likely completely invisible, which I guess is why nobody really noticed any issue until now. Similarly, with reads, the device side read-ahead may hide the problem, albeit that depends on how "intelligent" the drive is at identifying sequential accesses. >> >> It can be seen that 1280k io is already close to upper limit, and it'll >> be hard to see differences with the default value, thus I set >> 'max_sectors_kb' to 128 in the following test. >> >> Test cmd: >> fio \ >> -filename=/dev/$dev \ >> -name=test \ >> -ioengine=psync \ >> -allow_mounted_write=0 \ >> -group_reporting \ >> -direct=1 \ >> -offset_increment=1g \ >> -rw=randwrite \ >> -bs=1024k \ >> -numjobs={1,2,4,8,16,32,64,128,256,512} \ >> -runtime=110 \ >> -ramp_time=10 >> >> Test result: MiB/s >> >> | numjobs | v5.18-rc1 | v5.18-rc1-patched | >> | ------- | --------- | ----------------- | >> | 1 | 67.7 | 67.7 | >> | 2 | 67.7 | 67.7 | >> | 4 | 67.7 | 67.7 | >> | 8 | 67.7 | 67.7 | >> | 16 | 64.8 | 65.6 | >> | 32 | 59.8 | 63.8 | >> | 64 | 54.9 | 59.4 | >> | 128 | 49 | 56.9 | >> | 256 | 37.7 | 58.3 | >> | 512 | 31.8 | 57.9 | Device write cache disabled ? Also, what is the max QD of this disk ? E.g., if it is SATA, it is 32, so you will only get at most 64 scheduler tags. So for any of your tests with more than 64 threads, many of the threads will be waiting for a scheduler tag for the BIO before the bio_split problem you explain triggers. Given that the numbers you show are the same for before-after patch with a number of threads <= 64, I am tempted to think that the problem is not really BIO splitting... What about random read workloads ? What kind of results do you see ? >> >> Yu Kuai (8): >> sbitmap: record the number of waiters for each waitqueue >> blk-mq: call 'bt_wait_ptr()' later in blk_mq_get_tag() >> sbitmap: make sure waitqueues are balanced >> blk-mq: don't preempt tag under heavy load >> sbitmap: force tag preemption if free tags are sufficient >> blk-mq: force tag preemption for split bios >> blk-mq: record how many tags are needed for splited bio >> sbitmap: wake up the number of threads based on required tags >> >> block/blk-merge.c | 8 +- >> block/blk-mq-tag.c | 49 +++++++++---- >> block/blk-mq.c | 54 +++++++++++++- >> block/blk-mq.h | 4 + >> include/linux/blk_types.h | 4 + >> include/linux/sbitmap.h | 9 +++ >> lib/sbitmap.c | 149 +++++++++++++++++++++++++++----------- >> 7 files changed, 216 insertions(+), 61 deletions(-) >> -- Damien Le Moal Western Digital Research