On Thu, 19 Mar 2015 13:41:16 -0600 Eric Mei <meijia@xxxxxxxxx> wrote: > On 2015-03-19 12:02 AM, NeilBrown wrote: > > On Wed, 18 Mar 2015 23:39:11 -0600 Eric Mei <meijia@xxxxxxxxx> wrote: > > > >> From: Eric Mei <eric.mei@xxxxxxxxxxx> > >> > >> When array is degraded, read data landed on failed drives will result in > >> reading rest of data in a stripe. So a single sequential read would > >> result in same data being read twice. > >> > >> This patch is to avoid chunk aligned read for degraded array. The > >> downside is to involve stripe cache which means associated CPU overhead > >> and extra memory copy. > >> > >> Signed-off-by: Eric Mei <eric.mei@xxxxxxxxxxx> > >> --- > >> drivers/md/raid5.c | 15 ++++++++++++--- > >> 1 files changed, 12 insertions(+), 3 deletions(-) > >> > >> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c > >> index cd2f96b..763c64a 100644 > >> --- a/drivers/md/raid5.c > >> +++ b/drivers/md/raid5.c > >> @@ -4180,8 +4180,12 @@ static int raid5_mergeable_bvec(struct mddev *mddev, > >> unsigned int chunk_sectors = mddev->chunk_sectors; > >> unsigned int bio_sectors = bvm->bi_size >> 9; > >> > >> - if ((bvm->bi_rw & 1) == WRITE) > >> - return biovec->bv_len; /* always allow writes to be > >> mergeable */ > >> + /* > >> + * always allow writes to be mergeable, read as well if array > >> + * is degraded as we'll go through stripe cache anyway. > >> + */ > >> + if ((bvm->bi_rw & 1) == WRITE || mddev->degraded) > >> + return biovec->bv_len; > >> > >> if (mddev->new_chunk_sectors < mddev->chunk_sectors) > >> chunk_sectors = mddev->new_chunk_sectors; > >> @@ -4656,7 +4660,12 @@ static void make_request(struct mddev *mddev, > >> struct bio * bi) > >> > >> md_write_start(mddev, bi); > >> > >> - if (rw == READ && > >> + /* > >> + * If array is degraded, better not do chunk aligned read because > >> + * later we might have to read it again in order to reconstruct > >> + * data on failed drives. > >> + */ > >> + if (rw == READ && mddev->degraded == 0 && > >> mddev->reshape_position == MaxSector && > >> chunk_aligned_read(mddev,bi)) > >> return; > > > > Thanks for the patch. > > > > However this sort of patch really needs to come with some concrete > > performance numbers. Preferably both sequential reads and random reads. > > > > I agree that sequential reads are likely to be faster, but how much faster > > are they? > > I imagine that this might make random reads a little slower. Does it? By > > how much? > > > > Thanks, > > NeilBrown > > > > Hi Neil, > > Sorry I should have done the test in first place. > > Following test are done on a enterprise storage node with Seagate 6T SAS > drives and Xeon E5-2648L CPU (10 cores, 1.9Ghz), 10 disks MD RAID6 8+2, > chunk size 128 KiB. > > I use FIO, using direct-io with various bs size, enough queue depth, > tested sequential and 100% random read against 3 array config: 1) > optimal, as baseline; 2) degraded; 3) degraded with this patch. Kernel > version is 4.0-rc3. > > Each individual test I only did once so there might be some variations, > but we just focus on big trend. > > Sequential Read: > bs=(KiB) optimal(MiB/s) degraded(MiB/s) degraded-with-patch (MiB/s) > 1024 1608 656 995 > 512 1624 710 956 > 256 1635 728 980 > 128 1636 771 983 > 64 1612 1119 1000 > 32 1580 1420 1004 > 16 1368 688 986 > 8 768 647 953 > 4 411 413 850 > > Random Read: > bs=(KiB) optimal(IOPS) degraded(IOPS) degraded-with-patch (IOPS) > 1024 163 160 156 > 512 274 273 272 > 256 426 428 424 > 128 576 592 591 > 64 726 724 726 > 32 849 848 837 > 16 900 970 971 > 8 927 940 929 > 4 948 940 955 > > Some notes: > * In sequential + optimal, as bs size getting smaller, the FIO thread > become CPU bound. > * In sequential + degraded, there's big increase when bs is 64K and > 32K, I don't have explanation. > * In sequential + degraded-with-patch, the MD thread mostly become CPU > bound. > > If you want to we can discuss specific data point in those data. But in > general it seems with this patch, we have more predictable and in most > cases significant better sequential read performance when array is > degraded, and almost no noticeable impact on random read. > > Performance is a complicated thing, the patch works well for this > particular configuration, but may not be universal. For example I > imagine testing on all SSD array may have very different result. But I > personally think in most cases IO bandwidth is more scarce resource than > CPU. > > Eric Thanks. That is reasonably convincing. I've added that text to the commit message, fixed up all the white-space damage in the patch (tabs were converted to spaces etc ... if you are going to be sending more patches, please find a way to convince your mailer that spaces are important), and applied it. It should be included in my pull request for 4.1 Thanks, NeilBrown
Attachment:
pgp90YFoM68pZ.pgp
Description: OpenPGP digital signature