Re: question about MD raid rebuild performance degradation even with speed_limit_min/speed_limit_max set.

NeilBrown <neilb@xxxxxxx> · Wed, 29 Oct 2014 13:57:49 +1100

On Tue, 28 Oct 2014 22:34:07 -0400 Jason Keltz <jas@xxxxxxxxxxxx> wrote:

> On 28/10/2014 6:38 PM, NeilBrown wrote:
> > On Mon, 20 Oct 2014 17:07:38 -0400 Jason Keltz<jas@xxxxxxxxxxxx>  wrote:
> >
> >> On 10/20/2014 12:19 PM, Jason Keltz wrote:
> >>> Hi.
> >>>
> >>> I'm creating a 22 x 2 TB SATA disk MD RAID10 on a new RHEL6 system.
> >>> I've experimented with setting "speed_limit_min" and "speed_limit_max"
> >>> kernel variables so that I get the best balance of performance during
> >>> a RAID rebuild of one of the RAID1 pairs. If, for example, I set
> >>> speed_limit_min AND speed_limit_max to 80000 then fail a disk when
> >>> there is no other disk activity, then I do get a rebuild rate of
> >>> around 80 MB/s. However, if I then start up a write intensive
> >>> operation on the MD array (eg. a dd, or a mkfs on an LVM logical
> >>> volume that is created on that MD), then, my write operation seems to
> >>> get "full power", and my rebuild drops to around 25 MB/s. This means
> >>> that the rebuild of my RAID10 disk is going to take a huge amount of
> >>> time (>12 hours!!!). When I set speed_limit_min and speed_limit_max to
> >>> the same value, am I not guaranteeing the rebuild speed? Is this a bug
> >>> that I should be reporting to Red Hat, or a "feature"?
> >>>
> >>> Thanks in advance for any help that you can provide...
> >>>
> >>> Jason.
> >> I would like to add that I downloaded the latest version of Ubuntu, and
> >> am running it on the same server with the same MD.
> >> When I set speed_limit_min and speed_limit_max to 80000, I was able to
> >> start two large dds on the md array, and the rebuild stuck at around 71
> >> MB/s, which is close enough.  This leads me to believe that the problem
> >> above is probably a RHEL6 issue.  However, after I stopped the two dd
> >> operations,  and raised both speed_limit_min and speed_limit_max to
> >> 120000, the rebuild stayed between 71-73 Mb/s for more than 10 minutes
> >> .. now it seems to be at 100 MB/s... but doesn't seem to get any higher
> >> (even though I had 120 MB/s and above on the RHEL system without any
> >> load)... Hmm.
> >>
> > md certainly cannot "guarantee" any speed - it can only deliver what the
> > underlying devices deliver.
> > I know the kernels logs say something about a "guarantee".  That was added
> > before my time and I haven't had occasion to remove it.
> >
> > md will normally just try to recover as fast as it can unless that exceeds
> > one of the limits - then it will back-off.
> > What speed it actually achieved depends on other load and the behaviour of
> > the IO scheduler.
> >
> > "RHEL6" and "Ubuntu" don't mean a lot to me.  Specific kernel version might,
> > though in the case of Redhat I know that backport lots of stuff so even the
> > kernel version isn't very helpful.  I'm must prefer having report against
> > mainline kernels.
> >
> > Rotating drives do get lower transfer speeds at higher addresses.  That might
> > explain the 120 / 100 difference.
> Hi Neil,
> Thanks very much for your response.
> I must say that I'm a little puzzled though. I'm coming from using a 
> 3Ware hardware RAID controller where I could configure how much of the 
> disk bandwidth is to be used for a rebuild versus I/O.   From what I 
> understand, you're saying that MD can only use the disk bandwidth 
> available to it.  It seems that it doesn't take any priority in the I/O 
> chain.  It will only attempt to use no less than min bandwidth, and no 
> more than max bandwidth for the rebuild, but if you're on a busy system, 
> and other system I/O needs that disk bandwidth, then there's nothing it 
> can do about it.  I guess I just don't understand why.  Why can't md be 
> given a priority in the kernel to allow the admin to decide how much 
> bandwidth goes to system I/O versus rebuild I/O.  Even in a busy system, 
> I still want to allocate at least some minimum bandwidth to MD.  In 
> fact, in the event of a disk failure, I want to have a whole lot of the 
> disk bandwidth dedicated to MD.  It's something about short term pain 
> for long term gain? I'd rather not have the users suffer at all, but if 
> they do have to suffer, I'd rather them suffer for a few hours, knowing 
> that after that, the RAID system is in a perfectly good state with no 
> bad disks as opposed to letting a bad disk resync take days because the 
> system is really busy... days during which another failure might occur!
> 
> Jason.

It isn't so much "that MD can only use..." but rather "that MD does only
use ...".

This is how the code has "always" worked and no-one has ever bothered to
change it, or to ask for it to be changed (that I recall).

There are difficulties in guaranteeing a minimum when the array uses
partitions from devices on which other partitions are used for other things.
In that case I don't think it is practical to make guarantees, but that
needn't stop us making guarantees when we can I guess.

If the configured bandwidth exceeded the physically available bandwidth I
don't think we would want to exclude non-resync IO completely, so the
guaranty would have to be:
   N MB/sec or M% of available, whichever is less

We could even implement the different approach in a back-compatible way.
Introduce a new setting "max_sync_percent".  By default that is unset and the
current algorithm applies.
If it is set to something below 100, non-resync IO is throttled to
an appropriate fraction of the actual resync throughput whenever that is
below sync_speed_min.

Or something like that.

Some care would be needed in comparing throughput and sync throughput is
measured per-device, while non-resync throughput might be measured per-array.
Maybe the throttling would happen per-device??

All we need now is for someone to firm up the design and then write the code.

NeilBrown
Attachment:
pgpglXd4zCpxP.pgp

Description: OpenPGP digital signature