Re: Raid0 performance regression

Lukas Straub <lukasstraub2@xxxxxx> · Sun, 23 Jan 2022 18:00:58 +0000

CC'ing Song Liu (md-raid maintainer) and linux-raid mailing list.

On Fri, 21 Jan 2022 16:38:03 +0000
Roger Willcocks <roger@xxxxxxxxxxxxxxxx> wrote:

> Hi folks,
> 
> we noticed a thirty percent drop in performance on one of our raid
> arrays when switching from CentOS 6.5 to 8.4; it uses raid0-like
> striping to balance (by time) access to a pair of hardware raid-6
> arrays. The underlying issue is also present in the native raid0
> driver so herewith the gory details; I'd appreciate your thoughts.
> 
> --
> 
> blkdev_direct_IO() calls submit_bio() which calls an outermost
> generic_make_request() (aka submit_bio_noacct()).
> 
> md_make_request() calls blk_queue_split() which cuts an incoming
> request into two parts with the first no larger than get_max_io_size()
> bytes (which in the case of raid0, is the chunk size):
> 
>   R -> AB
>   
> blk_queue_split() gives the second part 'B' to generic_make_request()
> to worry about later and returns the first part 'A'.
> 
> md_make_request() then passes 'A' to a more specific request handler,
> In this case raid0_make_request().
> 
> raid0_make_request() cuts its incoming request into two parts at the
> next chunk boundary:
> 
> A -> ab
> 
> it then fixes up the device (chooses a physical device) for 'a', and
> gives both parts, separately, to generic make request()
> 
> This is where things go awry, because 'b' is still targetted to the
> original device (same as 'B'), but 'B' was queued before 'b'. So we
> end up with:
> 
>   R -> Bab
> 
> The outermost generic_make_request() then cuts 'B' at
> get_max_io_size(), and the process repeats. Ascii art follows:
> 
> 
>     /---------------------------------------------------/   incoming rq
> 
>     /--------/--------/--------/--------/--------/------/   max_io_size
>       
> |--------|--------|--------|--------|--------|--------|--------| chunks
> 
> |...=====|---=====|---=====|---=====|---=====|---=====|--......| rq out
>       a    b  c     d  e     f  g     h  i     j  k     l
> 
> Actual submission order for two-disk raid0: 'aeilhd' and 'cgkjfb'
> 
> --
> 
> There are several potential fixes -
> 
> simplest is to set raid0 blk_queue_max_hw_sectors() to UINT_MAX
> instead of chunk_size, so that raid0_make_request() receives the
> entire transfer length and cuts it up at chunk boundaries;
> 
> neatest is for raid0_make_request() to recognise that 'b' doesn't
> cross a chunk boundary so it can be sent directly to the physical
> device;
> 
> and correct is for blk_queue_split to requeue 'A' before 'B'.
> 
> --
> 
> There's also a second issue - with large raid0 chunk size (256K), the
> segments submitted to the physical device are at least 128K and
> trigger the early unplug code in blk_mq_make_request(), so the
> requests are never merged. There are legitimate reasons for a large
> chunk size so this seems unhelpful.
> 
> --
> 
> As I said, I'd appreciate your thoughts.
> 
> --
> 
> Roger
> 
> --
> dm-devel mailing list
> dm-devel@xxxxxxxxxx
> https://listman.redhat.com/mailman/listinfo/dm-devel
> 

-- 

Attachment:
pgpai604TRFxM.pgp

Description: OpenPGP digital signature
--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://listman.redhat.com/mailman/listinfo/dm-devel