Hi folks, we noticed a thirty percent drop in performance on one of our raid arrays when switching from CentOS 6.5 to 8.4; it uses raid0-like striping to balance (by time) access to a pair of hardware raid-6 arrays. The underlying issue is also present in the native raid0 driver so herewith the gory details; I'd appreciate your thoughts. -- blkdev_direct_IO() calls submit_bio() which calls an outermost generic_make_request() (aka submit_bio_noacct()). md_make_request() calls blk_queue_split() which cuts an incoming request into two parts with the first no larger than get_max_io_size() bytes (which in the case of raid0, is the chunk size): R -> AB blk_queue_split() gives the second part 'B' to generic_make_request() to worry about later and returns the first part 'A'. md_make_request() then passes 'A' to a more specific request handler, In this case raid0_make_request(). raid0_make_request() cuts its incoming request into two parts at the next chunk boundary: A -> ab it then fixes up the device (chooses a physical device) for 'a', and gives both parts, separately, to generic make request() This is where things go awry, because 'b' is still targetted to the original device (same as 'B'), but 'B' was queued before 'b'. So we end up with: R -> Bab The outermost generic_make_request() then cuts 'B' at get_max_io_size(), and the process repeats. Ascii art follows: /---------------------------------------------------/ incoming rq /--------/--------/--------/--------/--------/------/ max_io_size |--------|--------|--------|--------|--------|--------|--------| chunks |...=====|---=====|---=====|---=====|---=====|---=====|--......| rq out a b c d e f g h i j k l Actual submission order for two-disk raid0: 'aeilhd' and 'cgkjfb' -- There are several potential fixes - simplest is to set raid0 blk_queue_max_hw_sectors() to UINT_MAX instead of chunk_size, so that raid0_make_request() receives the entire transfer length and cuts it up at chunk boundaries; neatest is for raid0_make_request() to recognise that 'b' doesn't cross a chunk boundary so it can be sent directly to the physical device; and correct is for blk_queue_split to requeue 'A' before 'B'. -- There's also a second issue - with large raid0 chunk size (256K), the segments submitted to the physical device are at least 128K and trigger the early unplug code in blk_mq_make_request(), so the requests are never merged. There are legitimate reasons for a large chunk size so this seems unhelpful. -- As I said, I'd appreciate your thoughts. -- Roger -- dm-devel mailing list dm-devel@xxxxxxxxxx https://listman.redhat.com/mailman/listinfo/dm-devel