Re: XFS on top RAID10 with odd drives count and 2 near copies

CoolCold <coolthecold@xxxxxxxxx> · Mon, 13 Feb 2012 13:46:07 +0400



On Mon, Feb 13, 2012 at 12:50 PM, David Brown <david@xxxxxxxxxxxxxxx> wrote:
>
> Comments at the bottom, as they are too mixed to put inline.
>
>
> On 12/02/2012 21:16, CoolCold wrote:
>>
>> First of all, Stan, thanks for such detailed answer, I greatly appreciate
>> this!
>>
>> On Sat, Feb 11, 2012 at 8:05 AM, Stan Hoeppner<stan@xxxxxxxxxxxxxxxxx>
>>  wrote:
>>>
>>> On 2/10/2012 9:17 AM, CoolCold wrote:
>>>>
>>>> I've got server with 7 SATA drives ( Hetzner's XS13 to be precise )
>>>> and created mdadm's raid10 with two near copies, then put LVM on it.
>>>> Now I'm planning to create xfs filesystem, but a bit confused about
>>>> stripe width/stripe unit values.
>>>
>>>
>>> Why use LVM at all?  Snapshots?  The XS13 has no option for more drives
>>> so it can't be for expansion flexibility.  If you don't 'need' LVM don't
>>> use it.  It unnecessarily complicates your setup and can degrade
>>> performance.
>>
>> There are several reasons for this - 1) I've made decision to use LMV
>> for all "data" volumes (those are except /, /boot, /home , etc)  2)
>> there will be mysql database which will need backups with snapshots 3)
>> I often have several ( 0-3 ) virtual environments (OpenVZ based) which
>> are living on ext3/ext4 (because of extensive metadata updates on xfs
>> makes it the whole machine slow) filesystem and different LV because
>> of this.
>>
>>>
>>>> As drives count is 7 and copies count is 2, so simple calculation
>>>> gives me datadrives count "3.5" which looks ugly. If I understand the
>>>> whole idea of sunit/swidth right, it should fill (or buffer) the full
>>>> stripe (sunit * data disks) and then do write, so optimization takes
>>>> place and all disks will work at once.
>>>
>>>
>>> Pretty close.  Stripe alignment is only applicable to allocation i.e new
>>> file creation, and log journal writes, but not file re-write nor read
>>> ops.  Note that stripe alignment will gain you nothing if your
>>> allocation workload doesn't match the stripe alignment.  For example
>>> writing a 32KB file every 20 seconds.  It'll take too long to fill the
>>> buffer before it's flushed and it's a tiny file, so you'll end up with
>>> many partial stripe width writes.
>>
>> Okay, got it - I've thinked in similar way.
>>>
>>>
>>>> My read load going be near random read ( sending pictures over http )
>>>> and looks like it doesn't matter how it will be set with sunit/swidth.
>>>
>>>
>>> ~13TB of "pictures" to serve eh?  Average JPG file size will be
>>> relatively small, correct?  Less than 1MB?  No, stripe alignment won't
>>> really help this workload at all, unless you upload a million files in
>>> one shot to populate the server.  In that case alignment will make the
>>> process complete more quickly.
>>
>> Basing on current storage, estimations show (df -h / df -i ) average
>> file size is ~200kb . Inodes count is near 15 millions and it will be
>> more.
>> I've just thought that may be I should change chunk size to 256kb,
>> just to let one file be read from one disk, this may increase latency
>> and increase throughput too.
>>
>>>
>>>>     root@datastor1:~# cat /proc/mdstat
>>>>     Personalities : [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
>>>>     md3 : active raid10 sdg5[6] sdf5[5] sde5[4] sdd5[3] sdc5[2] sdb5[1]
>>>> sda5[0]
>>>>           10106943808 blocks super 1.2 64K chunks 2 near-copies [7/7]
>>>> [UUUUUUU]
>>>>           [>....................]  resync =  0.8%
>>>> (81543680/10106943808) finish=886.0min speed=188570K/sec
>>>>           bitmap: 76/76 pages [304KB], 65536KB chunk
>>>
>>>
>>>> Almost default mkfs.xfs creating options produced:
>>>>
>>>>     root@datastor1:~# mkfs.xfs -l lazy-count=1 /dev/data/db -f
>>>>     meta-data=/dev/data/db       isize=256    agcount=32,
>>>> agsize=16777216 blks
>>>>              =                       sectsz=512   attr=2, projid32bit=0
>>>>     data     =                       bsize=4096   blocks=536870912,
>>>> imaxpct=5
>>>>              =                       sunit=16     swidth=112 blks
>>>>     naming   =version 2              bsize=4096   ascii-ci=0
>>>>     log      =internal log           bsize=4096   blocks=262144,
>>>> version=2
>>>>              =                       sectsz=512   sunit=16 blks,
>>>> lazy-count=1
>>>>     realtime =none                   extsz=4096   blocks=0, rtextents=0
>>>>
>>>>
>>>> As I can see, it is created 112/16 = 7 chunks swidth, which correlate
>>>> with my version b) , and I guess I will leave it this way.
>>>
>>>
>>> The default mkfs.xfs algorithms don't seem to play well with the
>>> mdraid10 near/far copy layouts.  The above configuration is doing a 7
>>> spindle stripe of 64KB, for a 448KB total stripe size.  This doesn't
>>> seem correct, as I don't believe a 7 drive RAID10 near is giving you 7
>>> spindles of stripe width.  I'm no expert on the near/far layouts, so I
>>> could be wrong here.  If a RAID0 stripe would yield a 7 spindle stripe
>>> width, I don't see how a RAID10/near would also be 7.  A straight RAID10
>>> with 8 drives would give a 4 spindle stripe width.
>>
>>
>> I've drawn nice picture from my head in my original post, it was:
>>
>> A1 A1 A2 A2 A3 A3 A4
>> A4 A5 A5 A6 A6 A7 A7
>> A8 A8 A9 A9 A10 A10 A11
>> A11 ...
>>
>> So here is A{X} is chunk number on top of 7 disks. As you can see, 7
>> chunks write (A1 - A7) will fill two rows. And this will made 2 disk
>> head movements to write full stripe, though that moves may be very
>> near to each other. Real situation may differ of course, and I'm not
>> expert to make a bet too.
>>
>>>
>>>> So, I'll be glad if anyone can review my thoughts and share yours.
>>>
>>>
>>> To provide you with any kind of concrete real world advice we need more
>>> details about your write workload/pattern.  In absence of that, and
>>> given what you've already stated, that the application is "sending
>>> pictures over http", then this seems to be a standard static web server
>>> workload.  In that case disk access, especially write throughput, is
>>> mostly irrelevant, as memory capacity becomes the performance limiting
>>> factor.  Given that you have 12GB of RAM for Apache/nginx/Lighty and
>>> buffer cache, how you setup the storage probably isn't going to make a
>>> big difference from a performance standpoint.
>>
>> Yes, this is standard static webserver workload with nginx as frontend
>> with almost only reads.
>>
>>
>>>
>>> That said, for this web server workload, you'll be better off it you
>>> avoid any kind of striping altogether, especially if using XFS.  You'll
>>> be dealing with millions of small picture files I assume, in hundreds or
>>> thousands of directories?  In that case play to XFS' strengths.  Here's
>>> how you do it:
>>
>> Hundreds directories at least, yes.
>> After reading you ideas and refinements, I'm making conclusion that I
>> need to push others [team members] harder to remove mysql instances
>> from the static files serving boxes at all, to free RAM for least
>> dcache entries.
>>
>> About avoiding striping - later in the text.
>>
>>>
>>> 1.  You chose mdraid10/near strictly because you have 7 disks and wanted
>>> to use them all.  You must eliminate that mindset.  Redo the array with
>>> 6 disks leaving the 7th as a spare (smart thing to do anyway).  What can
>>> you really to with 10.5TB that you can't with 9TB?
>>
>> Hetzner's guys were pretty fast on chaning failed disks (one - two
>> days after claim) so I may try without spares I guess... I just wanna
>> use more independent spindles here, but I'll think about your
>> suggestion one more time, thanks.
>>
>>>
>>> 2.  Take your 6 disks and create 3 mdraid1 mirror pairs--don't use
>>> partitions as these are surely Advanced Format drives.  Now take those 3
>>> mdraid mirror devices and create a layered mdraid --linear array of the
>>> three.  The result will be a ~9TB mdraid device.
>>>
>>> 3.  Using a linear concat of 3 mirrors with XFS will yield some
>>> advantages over a striped array for this picture serving workload.
>>> Format the array with:
>>>
>>> /$ mkfs.xfs -d agcount=12 /dev/mdx
>>>
>>> That will give you 12 allocation groups of 750GB each, 4 AGs per
>>> effective spindle.  Using too many AGs will cause excessive head seeking
>>> under load, especially with a low disk count in the array.  The mkfs.xfs
>>> agcount default is 4 for this reason.  As a general rule you want a
>>> lower agcount when using low RPM drives (5.9k, 7.2k) and a higher
>>> agcount with fast drives (10k, 15k).
>>
>> Good to know such details!
>>
>>>
>>> Directories drive XFS parallelism, with each directory being created in
>>> a different AG, allowing XFS to write/read 12 files in parallel (far in
>>> excess of the IO capabilities of the 3 drives) without having to worry
>>> about stripe alignment.  Since your file layout will have many hundreds
>>> or thousands of directories and millions of files, you'll get maximum
>>> performance from this setup.
>>
>>
>> So, as I could understand, you are assuming that "internal striping"
>> by using AGs of XFS will be better than MD/LVM striping here? Never
>> thought of XFS in this way and it is interesting point.
>>
>>>
>>> As I said, if I understand your workload correctly, array/filesystem
>>> layout probably don't make much difference.  But if you're after
>>> something optimal and less complicated, for piece of mind, etc, this is
>>> a better solution than the 7 disk RAID10 near layout with XFS.
>>>
>>> Oh, and don't forget to mount the XFS filesystem with the inode64 option
>>> in any case, lest performance will be much less than optimal, and you
>>> may run out of directory inodes as the FS fills up.
>>
>> Okay.
>>
>>>
>>> Hope this information was helpful.
>>
>> Yes, very helpful and refreshing, thanks for you comments!
>>
>> P.S. As I've got 2nd server of the same config, may be i'll have time
>> and do fast&  dirty tests of stripes vs AGs.
>>>
>>>
>>> --
>>> Stan
>>
>>
>
> Here a few general points:
>
> XFS has a unique (AFAIK) feature of spreading allocation groups across the
> (logical) disk, and letting these AG's work almost independently. So if you
> have multiple disks (or raid arrays, such as raid1/raid10 pairs), and the
> number of AG's is divisible by the number of disks, then a linear
> concatenation of the disks will work well with XFS.  Each access to a file
> will be handled within one AG, and therefore within one disk (or pair).
>  This means you don't get striping or other multiple-spindle benefits for
> that access - but it also means the access is almost entirely independent of
> other accesses to AG's on other disks.  In comparison, if you had a RAID6
> setup, a single write would use /all/ the disks and mean that every other
> access is blocked for a bit.
>
> But there are caveats.
>
> Top level directories are spread out among the AG's, so it only works well
> if you have a balanced access through a range of directories, such asa /home
> with a subdirectory per user, or a /var/mail with a subdirectory per email
> account.  If you have a /var/www with two subdirectories "main" and
> "testsite", it will be terrible.  And you must also remember that you don't
> get multi-spindle benefits for large streamed reads and writes - you need
> multiple concurrent access to see any benefits.
>
> If you have several filesystems on the same array (via LVM or other
> partitioning), you will lose most of the elegance and benefits of this type
> of XFS arrangement.  You really want to use it on a dedicated array.
>
> It is also far from clear whether a linear concat XFS is better than a
> normal XFS on a raid0 of the same drives (or raid1 pairs).  I think it will
> have lower average latencies on small accesses if you also have big
> reads/writes mixed in, but you will also have lower throughput for larger
> accesses.  For some uses, this sort of XFS arrangement is ideal - a
> particular favourite is for mail servers.  But I suspect in many other cases
> you will stray enough from the ideal access patterns to lose any benefits it
> might have.
>
> Stan is the expert on this, and can give advice on getting the best out of
> XFS.  But personally I don't think a linear concat there is the best way to
> go - especially when you want LVM and multiple filesystems on the array.
>
>
> As another point, since you have mostly read accesses, you should probably
> use raid10,f2 far layout rather than near layout.  It's a bit slower for
> writes, but can be much faster for reads.
>
> mvh.,
>
> David
David, thank you too - you have formalized and written down what I had
babelized in my head. Though I not going to have large sequential
writes/reads, info about "far" layouts is useful and I may use it
later as reference.

>
>
>
>
>


-- 
Best regards,
[COOLCOLD-RIPN]
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html