Re: Awful RAID5 random read performance

Keld Jørn Simonsen <keld@xxxxxxxx> · Thu, 4 Jun 2009 13:23:57 +0200

On Thu, Jun 04, 2009 at 12:21:02AM +0200, Goswin von Brederlow wrote:
> John Robinson <john.robinson@xxxxxxxxxxxxxxxx> writes:
> 
> > On 03/06/2009 19:38, Bill Davidsen wrote:
> >> John Robinson wrote:
> >>> On 02/06/2009 20:47, Keld Jørn Simonsen wrote:
> > [...]
> >>>> In your case, using 3 disks, raid5 should give about 210 % of the
> >>>> nominal
> >>>> single disk speed for big file reads, and maybe 180 % for big file
> >>>> writes. raid10,f2 should give about 290 % for big file reads and 140%
> >>>> for big file writes. Random reads should be about the same for raid5 and
> >>>> raid10,f2 - raid10,f2 maybe 15 % faster, while random writes should be
> >>>> mediocre for raid5, and good for raid10,f2.
> >>>
> >>> I'd be interested in reading about where you got these figures from
> >>> and/or the rationale behind them; I'd have guessed differently...

See more on our wiki for actual benchmarks,
http://linux-raid.osdl.org/index.php/Performance
http://blog.jamponi.net/2008/07/raid56-and-10-benchmarks-on-26255_10.html
The latter reports on arrays with 4 disks, som downscale it and you get
a good idea of expected values for 3 disks.

> >> For small values of N, 10,f2 generally comes quite close to N*Sr,
> >> where N is # of disks and Sr is single drive read speed. This is
> >> assuming fiarly large reads and adequate stripe buffer
> >> space. Obviously for larger values of N that saturates something
> >> else in the system, like the bus, before N gets too large. I don't
> >> generally see more than (N/2-1)*Sw for write, at least for large
> >> writes. I came up with those numbers based on testing 3-4-5 drive
> >> arrays which do large file transfers. If you want to read more than
> >> large file speed into them, feel free.
> 
> With far copies reading is like reading raid0 and writing is like
> raid0 but writing twice with a seek between each. So (N/2) and (N/2-a
> bit) are the theoretical maximums and raid10 comes damn close to those.

My take on theoretical maxima is:
raid10,f2 for sequential reads: N * Sr
Raid10,f2 for sequential writes:  N/2 * Sw

> 
> > Actually it was the RAID-5 figures I'd have guessed differently. I'd
> > expect ~290% (rather than 210%) for big 3-disc RAID-5 reads, and ~140%
> > (rather than "mediocre") for random small writes. But of course I
> > haven't tested.
> 
> That kind of depends on the chunk size I think.
> 
> Say you have a raid 5 with chunk size << size of 1 track. Then on each
> disk you read 2 chunks, skip a chunk, read 2 chunks, skip a chunk. But
> skipping a chunk means waiting for the disk to rotate over it. That
> takes as long as reading it. You shouldn't even get 210% speed.
> 
> Only if chunk size >> size of 1 track could you seek over a
> chunk. And you have to hope that by the time you have seeked the start
> of the next chunk hasn't rotated past the head yet.
> 
> Anyone know what the size of a track is on modern disks? How many
> sectors/track do they have?

I believe Goswins analyses here is valid, skipping sectors is as
expensive as reading them. 

Anyway, using somewhat bigger chunk sizes you may get into the effect of
not reading/seeking over data, and thus go beyond the N-1 mark. As I was
trying to report best values obtainable, then I chose to report this
factor also. Actually some figures show a loss of only 0.50 for
sequential reads on raid5 with a chunk size of 2 MB.

For sequential writes I was asuming that you were writing 2 data stripes and 1
parity stripe, and that the theoretical effective writing speed would
get close to 2 (for a 3 disk raid5). Jon's benchmark does not support
this. His best figures for raid5 is a loss of 2.25 write speed,
where I would expect somethng like a little more than 1. Maybe the fact
that the test is on raw partitions, and not on a file system with an
active elevator is in play here. Maybe it is because there is quite some
calculations involved for the parity calculation, and because of no
elevator, the system have to wait for completion of parity calculation
before parity writes can be done.

For random writes on raid5 I reported "mediocre". This is because that
if you write randomly in raid5, you need to first read the chunk, read
the parity chunk, do updating and then write the chunk and the parity
chunk again. And you need to read full chuncks. So at most you
will something like N/4 if your data size is close to the chunk size.
If you have a big chunk size and smallish payload size than a lot of
read/writes are done on uninteresting data. This probably also goes for
other raid types, and the fs elevator may help a little here, especially
for writing. 

In general I think raid5 random writes would be in the order of N/4
where mirrored raid types would be N/2 (with 2 copies) - making raid5
half speed of mirrored raid types like raid1 and raid10. I am not sure I
have data to back that statement up.

best regards
keld
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html