Re: Small chunk size read performance penalty

Stan Hoeppner <stan@xxxxxxxxxxxxxxxxx> · Mon, 19 Aug 2013 21:28:43 -0500

On 8/19/2013 12:49 AM, Ian Pilcher wrote:
> On 08/18/2013 08:40 PM, Stan Hoeppner wrote:
>> Can you elaborate on your workload that demonstrates this?  Different
>> workloads behave differently with different chunk sizes.
> 
> dd ... at block sizes between 4KiB and 1MiB, on RAID-5 and -6 arrays
> with chunk sizes in the same range.
> 
> Hardware is 5 7200 RPM SATA drives in a NAS (Thecus N5550) with an Atom
> D2550 processor and an ICH10R chipset.  The drives are all connected to
> the chipset's built-in AHCI controller.
> 
>> If you can see it, then please demonstrate this read penalty with
>> numbers.  You obviously have test data from the same set of disks with
>> two different RAID5s of different chunk sizes.  This is required to see
>> such a difference in performance.  Please share this data with us.
> 
> I've uploaded the data (in OpenDocument spreadsheet form) to Dropbox.  I
> think that it's accessible at this link:
> 
>   https://www.dropbox.com/s/4dq93th4wu5rr2y/nas_benchmarks.ods
> 
> (This is my first attempt at sharing anything via Dropbox, so let me
> know if it doesn't work.)
> 
> I actually find your response really interesting.  From my Interweb
> searching, the "small stripe size read penalty" seems to be pretty
> widely accepted, much as the "large stripe size write penalty" is.  It
> certainly does show up in my data; as the chunk size increases reads of
> even small blocks get faster.

Everything in the world of storage performance depends on the workload.
 The statements above assume an unstated workload, and are so general as
to not be worth repeating, and certainly not putting any stock in.

The former is true of large streaming workloads.  If your workload deals
with small IO reads, such as mail serving, then a small stripe is not
detrimental as the mail file you're reading is almost always smaller
than the stripe size, and often smaller than the chunk size.  Using a
large chunk/stripe with such a workload can create hotspots on some
disks in the array, increasing latency, and decreasing throughput.

However, in this scenario, the big win is in write latency.  A large
chunk/stripe size will generate a huge amount of unnecessary read IO
during RMW cycles to recalculate parity when you write a new mail
message into an existing stripe.  With an optimal chunk/stripe for this
workload, you read few extra sectors during RMW.  It's often very
difficult to get this balance right.  And even if you do, mail workloads
are still many times slower on parity RAID than on mirrors or striped
mirrors (RAID10).  This obviously depends on load.  Even "low end"
modern server hardware with md RAID6 and a handful of disks can easily
handle a few hundred active mail users.  Once you get into the thousands
you'll need mirror based RAID as RMW latency will grind you to a halt.
The same hardware is plenty.  You simply change the RAID level.  You'll
need a couple more disks to maintain total capacity, but simply changing
to mirror based RAID will increase throughput 5-15 fold, and decrease
latency substantially.

Any "large stripe size write penalty" will be a function of mismatching
the workload to the RAID stripe and/or array/drive hardware.  Using a
large stripe with a mail workload will yield poor performance indeed due
to large RMW bandwidth/latency.  Large stripe with this workload
typically means >32-64KB.  Yes, that's stripe, not chunk.  For this
workload using a 6 drive RAID6 you'd want an 8-16KB chunk for a 32-64KB
stripe.  This is the opposite of the meme you quote above.  Again,
workload dependent.

If your workload is HPC file serving, where user files are 10s to 100s
of GB, even TBs in size, then you'd want the largest chunk/strip/stripe
your hardware can perform well with.  This may be as low as 512KB or it
may be as large as 2MB.  And it will likely be hardware based RAID, not
Linux md.

-- 
Stan

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html