Re: RAID and LVM alignment when expanding PVs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I have already tried using linear mode, performance drops significantly.
To show you how big the impact is, with seqrd 64 threads (sysbench) i
get about 2.5Gb/s using striped LVM, with linear mode i get around
800MB/s.

I don't want to be rude, but please, before saying what i have below
is a damn mess, it's after i have spent hours of running benchmarks
and getting the correct numbers.
I assume you told me to do this because it's SSD and i will saturate
the PCI BUS before i will be able to saturate the disks, this
assumption is usually wrong, especially when this LSI Controller is
Gen3.

I also don't need you to explain this to me as I understand exactly
why it's slow, I came here asking a simple question which we can
summarize: "If i have PV which is based on mdadm array, i then expand
the mdadm array, do i lose the data alignment in LVM?"

I always opt to mailing list as the last option and this was also a
suggestion i got from the IRC channel.


On Sun, Oct 28, 2012 at 4:20 AM, Stan Hoeppner <stan@xxxxxxxxxxxxxxxxx> wrote:
> My eyes are bleeding and I have a headache after reading this.  I
> despise top posting, but there's no other way here.
>
> What you have below is a damn mess.  Throw it all out and do this the
> easy, correct way:
>
> 1.  Create an mdadm linear array containing the LSI logical devices
> 2.  mkfs.xfs /dev/md0
> 3.  Done.
>
> Wasn't that easy?
>
> With SSDs and BBWC latency is zero for practical purposes, so you have
> no RMW penalty, making XFS journal/data stripe alignment unnecessary.
>
> The only XFS mount option you need in fstab is nobarrier.  Relatime, the
> XFS default, is equivalent to noatime and nodiratime.  The rest of what
> you manually specified below are current default values.
>
> Expanding capacity with XFS and a concat (linear array) is simplicity:
>
> Grow the next new LSI logical device into the linear array and grow the
> XFS.  Since you haven't specified su/sw your new RAID5 geometry doesn't
> have to match the previous ones, i.e. adding a 3 drive or 9 drive RAID5
> is fine.  You could even add a RAID10 or RAID6 array and it wold work
> just fine.
>
> Given your post below, it's almost assured that you will want me to
> expend another post or 3 explaining in minute detail how/why the above
> configuration works and why you should use it.  I will not do so here
> again.  I've explained the virtues of this setup on this and the Dovecot
> list many times and those posts are in multiple internet archives.  I've
> given you what you need to know to set this up in a sane and simple high
> performance manner.  You can test it yourself or simply ignore my
> advice.  That's up to you.
>
> The commands to create and grow an md linear array are in the mdadm man
> page.  Those to grow an XFS are in the xfs_growfs man page.  If you need
> minor clarification on some point I'll be glad to respond, but I'm not
> going to write another thesis on this.
>
> Best of luck.
>
> --
> Stan
>
>
> On 10/27/2012 3:49 PM, Erez Zarum wrote:
>> Hey,
>> Before posting in here i have spent some time searching for answers
>> also on the LVM irc channel, I actually don't use the linux raid
>> (mdadm) but my question on the behavior of mdadm in conjunction with
>> the LVM might answer it.
>> The server is going to run MySQL with InnoDB, as i'm not going to
>> change InnoDB default block size so the blocksize it uses is 16KiB.
>> I have an LSI Controller with a BBU and 25 SSDs disks, (i have total
>> 28 at the moment, 3 are hot spares)
>> I created 5 Logical Drives (LD), each one is a RAID5 (4+1) with a
>> stripe size of 256KiB and presented them to the OS.
>> These gave me /dev/sdd /dev/sde /dev/sdf /dev/sdg and /dev/sdh (/dev/sd[defgh]).
>> What i wish to achieve is a full utilization of those LDs using one
>> block device with a way to expand in the future.
>> At first i opted to use SW RAID (mdadm) using RAID0, and then create
>> an LVM on top of it, but at the moment i can't expand RAID0 (i know it
>> will be possible in the future) so i decided to go with LVM striping.
>> So i have created 5 PVs aligned at 1024KiB (4*256KiB), though it's
>> LVM's default in RHEL 6.
>>
>> $ pvcreate -M2 --pvmetadatacopies 2 --dataalignment=1024k /dev/sdd
>> $ pvcreate -M2 --pvmetadatacopies 2 --dataalignment=1024k /dev/sde
>> $ pvcreate -M2 --pvmetadatacopies 2 --dataalignment=1024k /dev/sdf
>> $ pvcreate -M2 --pvmetadatacopies 2 --dataalignment=1024k /dev/sdg
>> $ pvcreate -M2 --pvmetadatacopies 2 --dataalignment=1024k /dev/sdh
>>
>> Then i created the VG:
>> $ vgcreate -M2 --vgmetadatacopies 2 vg1 /dev/sdd /dev/sde /dev/sdf
>> /dev/sdg /dev/sdh
>>
>> And then i created a striped LV with 256KiB stripe size and left 5%
>> free for snapshots (not that i will do much of them, perhaps once a
>> week for a backup and then delete it):
>> $ lvcreate -i 5 -I 256k -n lv1 -l 95%FREE vg1 /dev/sdd /dev/sde
>> /dev/sdf /dev/sdg /dev/sdh
>>
>> As i am using XFS i created it with the following parameters:
>> $ mkfs.xfs -d su=256k,sw=20 /dev/vg1/lv1
>>
>> And i used the following mount options:
>> noatime,nodiratime,nobarrier,logbufs=8,logbsize=256k
>>
>> So everything should be aligned now.
>>
>> When the time comes and i will need to expand the capacity on the LV i
>> will not be able to add more PVs into the LV as it's a striped LV
>> (unless i will multiply it by two, but i won't expand in this rate).
>> My only option in this setup is to expand the underlying RAID5 by
>> adding more disks into it, I will grow by a multiple of 2 per LD.
>> I will then have 5 LDs, each one is RAID5 (6+1) with a stripe size of 256KiB.
>> I was under the impression it will be simple, i will expand the PVs,
>> then the VG and then the LV, this will work, but now the PV won't be
>> aligned (or maybe i'm wrong here?)
>> Now my stripe width is 1536KiB (6*256KiB), but the PVs were created
>> with dataalignment of 1024KiB, which means after adding more disks to
>> the underlying RAID5 LDs I am not longer aligned at stripe boundaries.
>>
>> As you know, it's not possible to change the PV metadata
>> (dataalignment) after it was created.
>> So i checked perhaps i can create those PVs (/dev/sd[defgh]) with no
>> metadata copies at all (--pvmetadatacopies 0) and then add two small
>> devices just for the metadata into the VG (I will never expand those
>> devices, so it will be aligned).
>> But after creating the PVs with --pvmetadatacopies 0, I saw that it
>> still saves small metadata (PVUUID, LABEL, etc..) and for obvious
>> reasons.
>> I searched and saw it's not possible to write the metadata of the PVs
>> to the end of the block device so if i'll grow the LDs (PVs) The data
>> will no longer be aligned.
>>
>> Because i know i can't extend the LV by adding one or two PVs this is
>> my only option, my estimated growth this way is up to 160 disks and i
>> will probably migrate the RAID5 to RAID6 (RAID5 with that amount of
>> disks is not that reliable), so i will have 5 x RAID6 30+2 at most.
>>
>> The other option i see that might work is creating a linear LV from
>> those 5 LDs, this will mean i will be able to grow by adding more PVs
>> (creating more RAID5 4+1 LDs) to the VG and then extend the LV.
>> As my MySQL InnoDB default blocksize is 16KiB (and i have no future
>> plans to change it) In either setup single write/read will go to one
>> disk.
>> The problem i have in this setup is that i couldn't make it work, I
>> know i need to align the XFS allocation group with the LV boundaries,
>> but i couldn't find a way to do it correctly, during my benchmarks i
>> utilized only 1 disk and didn't get that much parallel I/O (regardless
>> of threads).
>> My other concern is if it's possible after extending the LV to tweak
>> xfs AGs (so I will still be on the LV boundaries).
>>
>> I ask the question in here, because I think in either way, be it HW
>> RAID or SW RAID (mdadm), expanding the PV (underlying block device), i
>> will be in the same situation, and as i know mdadm and LVM had some
>> kind of integration (LVM read mdadm sysfs for alignment, etc..)
>> perhaps i miss something.
>>
>> Thanks for any help!
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux