CC'ing Song Liu (md-raid maintainer) and linux-raid mailing list. On Fri, 21 Jan 2022 16:38:03 +0000 Roger Willcocks <roger@xxxxxxxxxxxxxxxx> wrote: > Hi folks, > > we noticed a thirty percent drop in performance on one of our raid > arrays when switching from CentOS 6.5 to 8.4; it uses raid0-like > striping to balance (by time) access to a pair of hardware raid-6 > arrays. The underlying issue is also present in the native raid0 > driver so herewith the gory details; I'd appreciate your thoughts. > > -- > > blkdev_direct_IO() calls submit_bio() which calls an outermost > generic_make_request() (aka submit_bio_noacct()). > > md_make_request() calls blk_queue_split() which cuts an incoming > request into two parts with the first no larger than get_max_io_size() > bytes (which in the case of raid0, is the chunk size): > > R -> AB > > blk_queue_split() gives the second part 'B' to generic_make_request() > to worry about later and returns the first part 'A'. > > md_make_request() then passes 'A' to a more specific request handler, > In this case raid0_make_request(). > > raid0_make_request() cuts its incoming request into two parts at the > next chunk boundary: > > A -> ab > > it then fixes up the device (chooses a physical device) for 'a', and > gives both parts, separately, to generic make request() > > This is where things go awry, because 'b' is still targetted to the > original device (same as 'B'), but 'B' was queued before 'b'. So we > end up with: > > R -> Bab > > The outermost generic_make_request() then cuts 'B' at > get_max_io_size(), and the process repeats. Ascii art follows: > > > /---------------------------------------------------/ incoming rq > > /--------/--------/--------/--------/--------/------/ max_io_size > > |--------|--------|--------|--------|--------|--------|--------| chunks > > |...=====|---=====|---=====|---=====|---=====|---=====|--......| rq out > a b c d e f g h i j k l > > Actual submission order for two-disk raid0: 'aeilhd' and 'cgkjfb' > > -- > > There are several potential fixes - > > simplest is to set raid0 blk_queue_max_hw_sectors() to UINT_MAX > instead of chunk_size, so that raid0_make_request() receives the > entire transfer length and cuts it up at chunk boundaries; > > neatest is for raid0_make_request() to recognise that 'b' doesn't > cross a chunk boundary so it can be sent directly to the physical > device; > > and correct is for blk_queue_split to requeue 'A' before 'B'. > > -- > > There's also a second issue - with large raid0 chunk size (256K), the > segments submitted to the physical device are at least 128K and > trigger the early unplug code in blk_mq_make_request(), so the > requests are never merged. There are legitimate reasons for a large > chunk size so this seems unhelpful. > > -- > > As I said, I'd appreciate your thoughts. > > -- > > Roger > > -- > dm-devel mailing list > dm-devel@xxxxxxxxxx > https://listman.redhat.com/mailman/listinfo/dm-devel > --
Attachment:
pgpcsNqtINpFh.pgp
Description: OpenPGP digital signature