Re: increasing stripe_cache_size decreases RAID-6 read throughput

Neil Brown <neilb@xxxxxxx> (by way of Neil Brown <neilb@xxxxxxx>) · Thu, 29 Apr 2010 14:34:03 +1000

On Tue, 27 Apr 2010 10:18:36 -0700
Joe Williams <jwilliams315@xxxxxxxxx> wrote:

> >
> >
> >> Next question, is it normal for md0 to have no queue_depth setting?
> >
> > Yes.  The stripe_cache_size is conceptually a similar think, but only
> > at a very abstract level.
> >
> >>
> >> Are there any other parameters that are important to performance that
> >> I should be looking at?
> >
> 
> >> I was expecting a little faster sequential reads, but 191 MB/s is not
> >> too bad. I'm not sure why it decreases to 130-131 MB/s at larger
> >> record sizes.
> >
> > I don't know why it would decrease either.  For sequential reads, read-ahead
> > should be scheduling all the read requests and that actual reads should just
> > be waiting for the read-ahead to complete.  So there shouldn't be any
> > variability - clearly there is.  I wonder if it is an XFS thing....
> > care to try a different filesystem for comparison?  ext3?
> 
> I can try ext3. When I run mkfs.ext3, are there any parameters that I
> should set to other than the default values?
> 

No, the defaults are normally fine.  There might be room for small
improvements through tuning, but for now we are really looking for big
effects.

> >
> 
> > That is very weird, as reads don't use the stripe cache at all - when
> > the array is not degraded and no overlapping writes are happening.
> >
> > And the stripe_cache is measured in pages-per-device.  So 2560 means
> > 2560*4k for each device. There are 3 data devices, so 30720K or 60 stripes.
> >
> > When you set stripe_cache_size to 16384, it would have consumed
> >  16384*5*4K == 320Meg
> > or 1/3 of your available RAM.  This might have affected throughput,
> > I'm not sure.
> 
> Ah, thanks for explaining that! I set the stripe cache much larger
> than I intended to. But I am a little confused about your
> calculations. FIrst you multiply 2560 x 4K x 3 data devices to get the
> total stripe_cache_size. But then you multiply 16384 x 4K x 5 devices
> to get the RAM usage. Why multiply time 3 in the first case, and 5 in
> the second? Does the stripe cache only cache data devices, or does it
> cache all the devices in the array?

I multiply by 3 when I'm calculating storage space in the array.
I multiply by 4 when I'm calculating the amount of RAM consumed.

The holds content for each device, whether data or parity.
We do all the parity calculations in the cache, so it has to store everything.

> 
> What stripe_cache_size value or values would you suggest I try to
> optimize write throughput?

No idea.  It is dependent on load and hardware characteristics.
Try lots of different numbers and draw a graph.

> 
> The default setting for stripe_cache_size was 256. So 256 x 4K = 1024K
> per device, which would be two stripes, I think (you commented to that
> effect earlier). But somehow the default setting was not optimal for
> sequential write throughput. When I increased stripe_cache_size, the
> sequential write throughput improved. Does that make sense? Why would
> it be necessary to cache more than 2 stripes to get optimal sequential
> write performance?

The individual devices have some optimal write size - possible one
track or one cylinder (if we pretend those words mean something useful these
days).
To be able to fill that you really need that much cache for each device.
Maybe your drives work best when they are sent 8M (16 stripes, as you say in
a subsequent email) before expecting the first write to complete..

You say you get about 250MB/sec, so that is about 80MB/sec per drive
(3 drives worth of data).
Rotational speed is what?  10K?  That is 166revs-per-second.
So about 500K per revolution.
I imagine you would need at least 3 revolutions worth of data in the cache,
one that is currently being written, one that is ready to be written next
(so the drive knows it can just keep writing) and one that you are in the
process of filling up.
You find that you need about 16 revolutions (it seems to be about one
revolution per stripe).  That is more than I would expect .... maybe there is
some extra latency somewhere.

NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html